The Aislelabs journey to cloud-native architecture
At Aislelabs, we have been diligently modernizing our tech stack as we transition to cloud-native technologies to run robust, scalable applications. We’ve learned a great deal about the process and invite the wider community to discover what we have learned on our journey.
What makes our transformation unique, compared to other companies that have publicly shared their stories, is that we are a B2B SaaS product for a mid-sized startup company. Our compute needs, therefore, required our team to rework containerization, workload orchestration, service discovery, observability, centralized logging, end-to-end testing, and processes across our entire development and deployment lifecycle. These are very different needs from those of a consumer-focused product or a much larger company.
Aislelabs: A (not so) unique perspective
At Aislelabs, we run a large, bare-metal infrastructure in a colo. We ingest over 8 billion new data points a day, peaking at almost a million new data points per second. Data collected every month, from hundreds of millions of visits to brick and mortar locations, is ingested, processed, stored, and analyzed by employing thousands of SSDs. In addition to our primary infrastructure, we use Azure and OVH clouds for data residency around the world and also provide on-premise setup options to some of our customers. We also have strict SLAs with our customers and have averaged 99.97% availability in recent years.
All of this makes our choices for the technologies we use tailor-made to meet our specific needs. We are sharing them, along with our choice rationale, to offer the broader community a perspective from a B2B SaaS product that processes billions of data points every day.
Scaling an Enterprise B2B Workload
Aislelabs sells to enterprise customers. Each new customer is onboarded after a sales cycle with purchase orders and implementation kickoffs. Chances of pulling of a Pokémon Go, where users multiply 50x overnight, are next to none. Demand is more predictable. While there are peaks and valleys, external factors do impact usage of enterprise products—something video conferencing experienced with COVID-19—demand is more predictable and these spikes are both rare and come with a longer lead time than consumer-focused products.
Bare Metal in the Cloud era
Everyone is moving to the cloud. We are not.
Let me rephrase it. We run most of our application on bare metal hardware, sourced and managed by our internal teams, and co-located in rented data centers. We do not use Amazon AWS. While I agree with the reasons behind Netflix’s move to AWS, we have neither their scale nor their deep pockets.
We do use third party providers like Microsoft Azure and OVH for compute resources, especially for some of our customers outside North America, but for the most part, we own, install, and manage our hardware in-house. We do this, not because we don’t like the cloud, but simply because cloud providers are expensive.
I plan to delve into the math behind the costs in another future blog post but, suffice it to say for our use case, if we were to move to Azure or AWS our costs would balloon by 500% to 1,000%. Compute or storage demand in our case does not fluctuate unpredictably so we do not benefit from cloud providers’ rapid elasticity. Aislelabs has been a profitable company for many years but we chose not to raise much venture capital, taking great care when spending money. We have achieved 99.97% infrastructure availability over the past 3 years, and we co-locate in the best data centre facility in the country, and we have steadily grown at a fast pace, all while being very cost-conscious.
We strive to meet cloud-native definition to enable loosely coupled systems that are resilient, manageable, and observable, to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Our aim is to have our application and its deployment to be completely independent of the underlying infrastructure allowing us to build and scale our application across multiple instances.
Engineering Team and Processes
We use numerous technologies to productively build our products and serve our customer needs. But it is we, the human technologists, that are the most important determinant of our success.
Aislelabs is a mid-sized company. We are no longer the small, scrappy startup we began as but we are also not a large company. Our entire engineering team can fit in a large room whereas, in a small startup, there usually is a single team working together. A large company has dedicated teams for managing infrastructure and site reliability—a luxury we currently do not have—hence, we seek technologies that are easy to manage for a small team but can scale to support a large customer base.
In addition to the technology choices, I also plan to go into the details of our team, processes, and software life cycle, and how it has evolved over the years in future blog posts.
To microservices or not to microservices
Microservices architecture is all the rage in the industry. This architecture style, when done right, can really help scale the engineering processes in a company with many teams. Microservices do not provide any direct technical advantages over a monolith, but they solve a big human organizational problem by breaking the application into more manageable pieces for smaller teams to take charge of each.
At Aislelabs, we operate what would be classified as a monolith. We have a single repository with plenty of Java code but multiple organized modules that can each be deployed independently. Additionally, we have a lot of React code for the frontend which uses REST APIs to communicate with the backend. We also inherited legacy code that uses JSPs and plain old HTML with jQuery, something we are slowly replacing as we move to React + APIs.
And while there are gradual paths to decompose a monolith to microservices, we are not there yet. For most companies, a well-designed monolith architecture is sufficient as microservices add a lot of complexity. Moving to microservices too early or in an incorrect manner can lead to even more pain, as experienced by Segment.
Open source DNA
Open source is part of our DNA and we favour it whenever possible assuming licenses permit. We have contributed to projects upstream and have released code of our own (with more to come). As we modernize our tech stack, we will focus exclusively on open source solutions.
A Modern Tech Stack
As our company grows, we are continually re-architecting our tech stack to meet the demand. The COVID-19 lockdown period, from March to July 2020, provided a unique opportunity for us to take a step back and double down on this process. Here’s an overview of how we operate our application and infrastructure today.
Containerization using Docker is now the defacto standard for running applications (here’s a good tutorial). It allows the application, and all required libraries and components, to be bundled as a self-contained image that can run on any host. This means a clear separation between the host (hardware and operating system) and the application.
When we first started moving to Docker we wondered “is it really all that useful?” Aislelabs runs a pure Java application and has virtually no operating system dependency, other than a compatible JRE. In the hindsight, we should have containerized our application sooner because:
- When we moved from JDK 8 to OpenJDK 11, our transition lasted a few weeks. Managing two versions of Java on the same host required additional deployment steps (which can be prone to errors).
- With containers, our CI/CD pipeline is cleaner. We always had scripts to deploy Java processes remotely, but it’s much simpler now with containers.
- Use of containers also helped improve and simplify our end-to-end testing process.
We will detail our migration journey to containers, including bugs encountered with Docker engine, nuances of setting resource quotas, and security aspects with privileged Docker vs rootless daemon vs Podman, in a future blog post.
Workload orchestration is an important tool to clearly separate the application from the underlying infrastructure.
Kubernetes is in vogue in the industry these days. Kubernetes is born out of Google, and is based on their internal cluster-management system Borg. Mesos was created to replicate Google’s proprietary architecture and is used by Twitter, Airbnb, and others. Docker created its own version, Docker Swarm, to manage containers across hosts.
After a lot of deliberation, we chose Hashicorp Nomad as best suited for Aislelabs. Nomad provides more than all the features we need without the added complexity of Kubernetes. With Nomad, we can declaratively define our workload and move away from traditional tight coupling of application and operating system. We do not have a dedicated SRE team that can run and manage Kubernetes (even after accounting for simplifications brought by tools like Rancher), nor do we need the additional power to run thousands of hosts with complex architecture. Nomad is used by companies like Trivago, PagerDuty, Cloudflare, Roblox and supports everything most companies need, while also considerably simple and lightweight to operate.
If you are interested, keep reading our full blog post covering our deliberations and eventual decision to move to Nomad, and why Kubernetes is not for everyone.
Service Discovery and Service Mesh
Two main things we were looking for the workload orchestrator to solve are declarative definitions of our application processes and improvements to our CI/CD and deployment processes. Both were achieved with Nomad. HashiCorp, the company behind Nomad, also makes Consul—a service discovery and service mesh solution. Consul works with Kubernetes and is one of the three most popular open source software packages of its kind, along with Linkerd and Istio.
We chose Consul to run alongside Nomad, as:
- Consul is easy to run and is tightly integrated with Nomad
- Consul provides service discovery and we integrate it with Nginx for automated deployment of every pull request as a test instance
- Consul provides health checks for Nomad tasks
We are a monolith, with no immediate plans to break it down to microservices. While we don’t really need service mesh, Consul is still an important part of our tech stack now.
Deploying an application is only the first step. Managing the application and associated underlying infrastructure is sometimes an underappreciated aspect. This is where observability comes in to monitor both host nodes and individual application processes, in granular detail, so that the application can be managed proactively.
We started with New Relic, many years ago, and then migrated to Datadog. Both these tools are proprietary and can get very expensive at scale. This year, we migrated to Prometheus + Grafana + Loki, allowing us to get better environment visibility while saving a ton on costs.
In the future, we’ll write a detailed blog post covering the observability stack at Aislelabs, covering how we orchestrate Prometheus and Grafana with Nomad, centralize all logging with Loki, and our future plans for use of Jaeger and Open Telemetry. We will also open source our Nomad job files for deploying these components at the same time and provide the GitHub link here.
Databases and Storage at Scale
The largest component of our infrastructure is storage. We collect billions of new data points every single day ingesting, processing, and storing 1,000,000 data points per second at peak. We store trillions of data points in thousands of terabytes of SSD storage and our storage array always keeps hundreds of terabytes available with 100,000+ IOPS, ready for complex analytics.
Suffice it to say, we run storage at massive scale and do so with off-the-shelf open-source tools. We probably operate one of the largest MySQL setups for a company of our size. We have looked at Vitess, which has quite an interesting history out of YouTube and Google’s Borg, but it is not fully MySQL compatible. We are also experimenting with TiDB but it has its own nuances like a 6MB limit on columns. As Nomad adds support for container storage interface, we will also consider using Ceph for non-database storage. Our experiences with storage will be documented here in the future.
We have always used Ansible and Cobbler for provisioning and configuring our servers, and used Git to store all configurations. But a lot of what we did was still procedural and not directly reproducible. This creates extra work as we deploy our application out of standard SaaS models (as hosted or on-premise) or make changes. We want to move to full declarative DevOps in the future but we are not at a stage or scale to fully adopt GitOps.
We are continuing to modernize our processes, with simple steps like switching from Cobbler to MAAS, or using only Ansible for configuration changes. We will document our progress and learnings on our blog.
End-to-End Continuous Testing
We love writing code. We love building features. And when we write new code, sometimes we break stuff. Our customers don’t like it if we break stuff. We have put a lot of processes and QA steps to ensure we can ship 100+ new enhancements and product updates every month while ensuring the absolute highest quality but this came at a cost: time.
This year, we started a major push for end-to-end testing and are in the process of moving from Selenium to Cypress. We are still adding more test cases every day, but we can now reduce some of the repetitive work our QA team was doing and focus even more on new updates and new features.
Final Thoughts and Additional Reading
With a combination of continuous integration with Jenkins, containerization with Docker, orchestration with Nomad, service discovery with Consul, and testing with Cypress, we are now able to run the entire test suite against every single commit to the git master branch. Everything is automated for peace of mind, and I can say, at least I sleep better at night.
This blog post will be the first of many to be published by the Aislelabs’ engineering team. You can continue reading about our engineering and tech stack at:
- Hashicorp Nomad: Workload Orchestration at Aislelabs
- Prometheus, Grafana, and Loki: Observability at Aislelabs
- Migration Journey: Oracle Java 8 to OpenJDK 11 and Beyond
- Cloud or Colo, Scaling without High Cost: Learnings from Running Aislelabs’ B2B Product Stack
You can also read all posts tagged engineering.