I run a personal fleet of a several servers and few old computers in my living room which run services like
It also hosts all the infrastructure required to support that deployment. Learning to create resilient low-maintenance infrastructure is a also a goal -- the means is an end in itself!
I made it thinking about how I would build a company's platform as an SRE. It has a mostly open-source stack with:
In this page I write up a technical overview of how I have architected the cluster. For findings and milestones I have achieved over time, you can check out my blog (which is hosted here too!).
I aim to update this page as the fleet evolves (I started in April 2023) so that it is up-to-date. Some notable milestones include, in order:
Inspired by Monzo's monorepo, my selfhosted repo includes a services/
directory where I develop Go microservices.
These are tiny (a few Go files each) and use gRPC to communicate between each other as well as the CockroachDB cluster for a backing store.
They are all built on CI on every push to master. You can find a blog post about how the automation around the builds works here.
The networking is not too complicated once you get past the VPN abstraction, which is the tricky bit to achieve without a SPOF. Some of my nodes are IPv6-capable and are IPv4-public, some are not IPv4-public, and one is neither and is behind a CGNAT and doesn't have a IPv6 address.
Networking setup (simplified). NATs in orange, ingress in green.
Using Tailscale as a mesh VPN has the huge advantage of a flat network topology once inside the VPN, even when some Nodes get disconnected.
In a completely unnecessary impulse, I also deployed Consul to operate an Envoy service mesh.
Note that mTLS between containers was not strictly required as all traffic is encrypted via Tailscale anyway.
Ingress is structured as follows:
Ingress setup
Where traefik is able to do mTLS with the Consul service mesh.
Container orchestration is done with Nomad. All nodes are clients (meaning a container can get scheduled on any node) and 3 nodes are Nomad servers (ie, the host the Nomad 'control plane', in k8s lingo).
The nodes that host Nomad servers happen to be the datacenter-hosted nodes, because they have more uptime than the ones hosted on London or Madrid.
Orchestration setup
All containers in the cluster are managed by Nomad. All non-containerised workloads (Nomad and Vault) are configuration-managed via NixOS.
All of this has a Grafana monitoring stack (Loki, Mimir, Grafana, etc). I do believe it a bit of an anti-pattern to run the monitoring on top of the infrastructure you are trying to monitor, but I chose to be pragmatic.
Overall, the setup is overkill for what I am running. I could achieve most of this with SSH and docker compose alone, but I learnt a great deal of SRE skills by trying to make a industry-grade platform that could scale to 10 or 1000 services!
If I was to found a startup tomorrow and had to develop its platform, I would reuse much of the technology I used for this side-project.