Is kubernetes (K8s) a meme, or should you stop what you are doing right now and pick it up? Read on for my recommendations a couple years into using K8s in production.
Over the last decade I have deployed software in just about every PAAS possible. In the beginning, it was good old SFTP and phpmyadmin, then it moved on to administering individual linux servers with shell script glue. The PAAS phenomena of heroku I experienced during its glory day and then I have experienced the second coming of low config PAAS services like vercel and netlify. But I never got to really use kubernetes beyond tutorials and short books.
In my current role I work across a couple k8s clusters, we haven’t needed to venture beyond the GKE platform. Today we’re going to dive into that experience, what you should expect, what to do (and not do).
Any serious deployment of kubernetes is going to involve multiple clusters. I believe this to be an absolute certainty. You need isolation between a dev and a prod environment, and you may even run a local environment although that’s something we have avoided thus far. Given the reality that you will be running multiple nodes, you should make sure to introduce some kind of infrastructure tool. I have found terraform to do a good job, and I prefer HCL to YAML, but YMMV!
There is relatively low friction to deploying new services. I would say it is equivalent to the ease of CapRova or Dokku. Beyond that, it works very well for adhoc things, I.E you just want to temporarily run a pod so that you can get a shell and test some things in the infrastructure.
Kind-of-suck is a mild way to put it. Coordinating jobs is non-existent in the platform, there is no facility for automated alerting. Monitoring gives you simple health/not-health status messages. At SAAS platform, we introduced dagster as an orchestration layer for jobs, but there are a few other good options that could be considered. If you’re raw-dogging the k8s CronJob, then you need to do stuff like have application code that logs specific messages so you can setup a metric on that log and alert on it.
Managing yaml is a weird sisyphean job. It is by far my least favourite language that I need to work with, and it makes up 90% of the lines in the infrastructure repos. I would LOVE to see better tools than pure yaml or kustomize here. There are so many footguns with YAML that I could riff on here, but I will keep it short and spare you the rant.
A declarative deployment manifest is obviously good for deployment, and it’s been really easy to manage this in CI/CD. At work we used github actions. These days there are just a huge plethora of good CI/CD tools though. Not Google cloud build though, I found that to be functional but not very good for sharing common patterns and actions.
Kubernetes can burn through all your cloud credit if you don’t take a cautious approach.Spinning up new nodes is so easy to do, and they can sit underutilised burning the runway. Workloads go wild with no memory limits, the controller helpfully kills another important process mid way through a distributed transaction. Managing this is a NEW job that exists in the kubernetes world. You get good reporting out of GKE to manage it, but a human is unfortunately still needed to check on things on a weekly basis. Being responsible for the platform still requires a good knowledge of what workloads run when so that you can ensure your cluster is efficient and optimised.
There is so much that you just don’t need to think about when you’re bought in to a good k8s platform like GKE. Things that are insanely easy and hardly need to be thought about:
We haven’t needed to implement our own custom operator, although you could totally take that path if you were a larger organisation, at our current size (3ish teams), it’s just not needed. We have integrated a few off the shelf components in to our cluster like Dagster and Minio, which provide fast redundant object storage and an orchestration platform respectively. I wouldn’t strictly recommend these unless you are doing a lot of data engineering and AI model work in your business.
I can’t under-represent how awesome it is to have an environment where you can easily temporarily spin up a postgres, a deployment and start messing with it in the infrastructure. This has been a huge boon for proof of concept work. Ease of experimentation is a virtue of infrastructure that every platform team should be focusing on.
There are three levels you can work at with k8s, and I think it’s worth considering that as a needed complexity. Using a cluster as a developer you can ignore a lot of the things in this list. Most of this list in fact is relevant to people in the middle category. People who are operating a cluster and also using that cluster to driver solutions and supporting a about 10 other devs in using the cluster.
Dropping down a level to making a cluster requires bringing a few other things, Auto scaling groups, VM images with k8s, configuration management. At the current SAAS we haven’t needed to drop to this level and in fact I think we get huge benefit from just sticking to the GKE platform. Giving up the ‘make a cluster’ flexibility is an intentional decision to depend on best practice and out of the box solutions.
You probably shouldn’t do this for critical workloads, but I’ve found it fantastic for development and prototype purposes. It’s super easy to destroy it when the experiment is done and you haven’t needed to faaf around with any terraform to get it done, you can do it all within a dev environment without involving any synchronised process and hat is a huge benefit (see “It’s easy to experiment”).
Working with k8s has introduced all kinds of weird new terminology, but I think it’s a good abstraction to have. See past rants about cattles vs pet. It’s a lot more fruitful to talk about the components of your system in terms of the Deployments/StatefulSets/Jobs. The learning curve is steep as you need to grok a lot, but it does make sense. Stick it through the painful start, it’s worth it!
For development, it’s really nice to be able to just run locally in a container, or pull something from the registry to get it working alongside your code. I would always reach for containers on any deployment even a single $5 digital ocean droplet project ( a note to the reader, that’s actually what your reading this website from right now).
Kubernetes is like a buffet with every tasty option available. You know you need to fill yourself up, but there are way too many options, and it’s not always obvious which of the many solutions is ‘right’. Many solutions are often recommended but overrated, ‘service mesh’ being the biggest culprit. We have managed to get a long way on our AI SAAS without bringing a service mesh. Maybe your milage is different here, but we found the existing abstractions over the network acceptable out of the box. Working with k8s, you can feel how long the platform has been around, some of the API’s are starting to show their age and multiple ways to do the same thing might exist, it’s worthwhile coming up with an ‘example’ service in your system so that devs know the right configuration to apply and don’t need to duplicate the work of deciding.
If you have a couple teams of developers, and you want to be deploying a lot of services, keeping a fairly independent release schedule then kubernetes might be a really good option for your team. If you are already on the google cloud platform and just spinning things up then it’s kind of a no brainer. The amount of built in services is really nice to have and will save you a huge amount of non-product work time at the start, allowing you to focus on delivering value. When your workload gets massive, you know you can scale and migrate to better services that make sense, and you are not locked in to a PAAS software that holds your business by the neck with an invoice.
It’s not to say k8s has no warts or pain. Don’t use kubernetes if you are a solo founder or a bootstrapper, or even working in an organisation that will only afford one or two teams. I also wouldn’t recommend agencies or consultancies to focus on k8s unless they are going to double down on the managed service and selling kubernetes operators as a business. It’s a very advanced, complicated tool to run in production and you really need to focus a bit of time on operating the system even if you get a lot of features out of the GKE platform anyway.