Aislelabs is an engineering driven company. We collect and process trillions of data points per year, providing a suite of products for analytics and marketing to enterprise customers. As part of our journey to a cloud-native architecture, the Aislelabs engineering team adopted Hashicorp Stack including Nomad as the workload orchestration software after considering a number of solutions, including vanilla Kubernetes, Rancher, DC/OS, Mesos, Docker Swarm, and others. This blog post will dive into:
- Why we chose Nomad
- Learnings in migrating our entire workload to Nomad
- How we manage lean day 2 operations
- Future deployments and associated technologies for our stack
Conext: B2B SaaS Product by a Fast-growing Startup
In order to fully understand this post, we recommend you first read Modernizing the Tech Stack for a B2B SaaS Product Running on Bare Metal which outlines who we are, how our products operate, and our needs. The linked post explains:
- We provide B2B software used by enterprise customers. Usage does not fluctuate nearly as much as consumer apps and we can predict our compute and storage requirements weeks in advance.
- We primarily deploy our products in multi-tenant SaaS architecture in colocated premises. We own, manage, and operate bare metal servers.
- In addition to primary deployment, we also have a few deployments in Azure cloud or customer on-premises for data residency requirements. We, therefore, need to make application deployment declarative and repeatable.
- We ingest, process, and store a lot of data. At peak, we ingest 1,000,000+ data points per second and can analyze billions of data points for interactive queries. Our infrastructure uses thousands of terabytes of SSDs and considerable RAM to achieve this.
- We are a mid-sized company that has enjoyed organic growth. We run a lean operation and do not want an army of SREs managing the infrastructure.
With our unique needs and constraints in mind, we evaluated different orchestrators.
Kubernetes is by far the most popular container orchestration system, born out of Google, and now under the auspices of the Cloud Native Computing Foundation (CNCF). An entire ecosystem of technologies has grown around Kubernetes, some of it fueled by projects within CNCF. Many software solutions, including HashiCorp’s Consul, puts Kubernetes compatibility front and centre of their websites; even though HashiCorp has competing orchestrator Nomad.
Kubernetes focuses on container orchestration, providing the most advanced orchestrator available and can be deployed in complex enterprise environments with thousands of nodes. Kubernetes has various components that allow you to select some of the ancillary services. For example, it uses etcd as a highly available key-value store, CoreDNS for service discovery, and a choice between Consul, Istio, Linkerd, and many others as a service mesh. This can be incredibly powerful—but also overwhelming.
Kubernetes, somewhat like Linux, has several distributions that take a more opinionated approach. We looked at Rancher as an option that simplifies installation and management of Kubernetes cluster. But even with these distributions, the onboarding is overwhelming unless you have a dedicated team.
Apache Mesos was originally made famous by solving Twitter’s fail whale issues and future projects like DC/OS try to solve the same problem. However, it doesn’t have the same momentum as Kubernetes but is just as complex.
Docker Swarm is Docker’s take on packaging and deploying applications across nodes. While good for small deployments and to get started quickly, it does not offer a full-featured orchestrator. With the breakup of Docker in two organizations after the acquisition of Docker enterprise by Mirantis, even with Mirantis’ commitment to Swarm, it’s harder to see a bright future for Swarm.
HashiCorp Nomad is a relatively lesser-known, but very powerful yet simpler alternative. It is open source with an active community and used in production by Roblox, CircleCI, Cloudflare, Trivago, Pandora, and many more. Importantly it is simple to install, manage, and learn. It allows the creation and deployment of job declarative definitions in JSON or HCL format across data centers, which is useful as we have a mix of on-premise, cloud, and colocated infrastructure. With Consul, it allows health checks and self-healing. While we don’t need the scale, it is comforting to know that it can scale to launch a million containers.
After careful deliberations, we decided to give Nomad a try.
Kubernetes vs Nomad Which one should one choose? Based on our experience over the past months:
Nomad provides everything needed to orchestrate all common use scenarios and is a great choice for the majority of teams. It’s great for even the smallest of the teams to mid-sized companies for what they need. It may work for large teams, as shown by Cloudflare and Rolbox, well as well – but I have no first-hand experience with that. HashiCorp’s documentation is great – I was able to install and run Nomad on my laptop in the first hour; set up a dev cluster with multiple nodes in the first couple of weeks of learning.
Kubernetes is the swiss army knife with an endless ecosystem. As of writing, there were 1,487 technologies in the CNCF landscape that work with or extend Kubernetes. With Kubernetes, if you need service mesh you can choose between Lstio, Linkerd, Consul, Kuma, Zuul, OpenServiceMesh, Traefik Mesh, and many others. With Nomad, your service choice mesh is limited to Consul. If you want cloud-native on-prem storage with Kubernetes you can choose between Portworx, Cepf, Linstor, Longhorn, Cinder, and 50 other choices; with Nomad in addition to Azure/AWS, the only on-prem choice is Ceph or Portworx at this point.
Kubernetes provides you endless choices. Nomad on the other hand has integrations with the one or two of the most used technologies in each category; making choice simpler but also limited.
Watch HashiCorp’s official take on the matter in this video starting at the 19:00 minute mark.
Migrating to Nomad
We formed a small team to assist with migration and good documentation allowed us to do so painlessly. In total it took us just 3 months, from start to finish, to migrate everything while still maintaining everyday operations and new product enhancements. That’s pretty fast and reiterates that we made the right choice by selecting Nomad.
Containerization Nomad offers a workload orchestration platform—as opposed to just container orchestration—and has a plugin-based model to allow orchestrating Docker containers, native executables, Java processes, or even Windows VMs. In our case, we were running Java-based applications so we started with exploring running Java directly with Nomad. We soon realized that it’s better to containerize the application, even if Nomad doesn’t need it to.
We recommend sticking to using containers unless there is a legacy application with no viable path to containerize it. With our Java-based workload, we can package the JRE and Tomcat inside the container, making version upgrade from JDK 8 to OpenJDK 11 easy or Tomcat version updates simple. It also allowed us to make the continuous integration pipeline with Jenkins and e2e testing simpler.
We had to learn how resource allocation works, both soft and hard limits, and ran into some issues with automated restarts by Nomad/Consul. We quickly learned how it operated and were able to make everything stable within a couple of weeks. We did run into a few bugs along the way, including JDK 11 LTS’ incorrect memory settings, Nomad’s job file validation, Docker daemon locking up (fixed by updating Docker version), etc., for which we had to find workarounds.
Disk Mounts and Security We found Docker’s default security policies unsettling. It runs daemon as root and any container can arbitrarily mount any disk location as root. We were however able to configure Nomad to disallow privileged containers and also restrict access to disk only via pre-defined volume mounts. We recommend everyone to do the same until Docker’s rootless daemon comes out of experimental mode or Podman becomes more popular.
CI/CD Pipeline Now with every code commit to every development branch, Jenkins runs tests, builds the image, uploads to local Sonatype Nexus3 repository, and deploys to a staging environment. Every commit to master branch triggers a full e2e test suite. Production deployments across data-centers are also triggered via Jenkins by using Nomad’s HTTP APIs.
Consul for Service Discovery
Consul integrates nicely with Nomad. Installation is a breeze with almost no configuration needed and documentation is decent. We use Consul only for service discovery today and use NGINX with consul templates. Since we prefer open source, and NGINX is putting more and more of its features behind its commercial NGINX Plus offering, we plan to experiment with Envoy and Consul’s ingress gateways in the future.
This migration also allowed us to go back and look at old scripts and deployment practices we adopted over the years. Some of them were poorly written and few others were not as stable as we would like. This included custom scripts to restart processes based on heuristics or periodic cleaning of disk space. We were either able to eliminate most of this with Consul’s health checks (using Prometheus’s Java checks inside our application codebase) or by adding scripts inside the container images themselves to run alongside Java processes. Overall this was a great spring cleaning opportunity to reduce our technical debt.
Our Wish List
In addition to our own application, we now also run a number of other software packages using their containerized versions with Nomad. This includes our observability stack with Prometheus and Grafana, centralized logging with Loki, and message queues with Kafka and Zookeeper. While Nomad provides a good orchestration system, documentation on running these common software applications with Nomad is limited.
Take Loki for example: it has an example Kubernetes deployment configuration and helm chart. No such example is available for Nomad. We spent a good four weeks writing a Nomad job file which deploys Loki in a microservices architecture with separate query frontend and ingester. It is partly because of lack of good documentation by Loki itself and partly because all available documentation is Kubernetes focused. We will contribute our learnings and sample Nomad files upstream in future.
We are looking at testing Ceph for block storage since Nomad now supports CSI and can mount Ceph storage volumes. Still, almost all documentation and sample configuration talk is about Kubernetes and not Nomad. Even the official Nomad documentation incorrectly says Openstack in the header.
My (unsolicited) advice to HashiCorp: Invest in creating Nomad-specific documentation for popular open source projects like Grafana, Loki, and Kafka and contribute it upstream. Contribute a documentation page on running XYZ with Nomad and Consul for top open-source projects. This will both increase visibility to Nomad and make life easier for those of us running these services on Nomad.
Overall we are happy with Nomad and are building even more tooling internally to streamline our deployment. Day 2 operations, which are often overlooked, for Nomad and Consul version upgrades are also simple. We are excited to embark on the Nomadic journey.
This blog post is part of a series of posts from Aislelabs’ engineering on modernizing the tech stack. If you are interested, keep reading (links to be added as posts become available):
- Aislelabs’ journey to cloud native architecture
- Centralized logging at scale with Grafana Loki with Nomad