Selfhosted Infra


I run a personal fleet of a several servers and few old computers in my living room which run services like

  • Personal DB
  • Personal storage and backups
  • VPN, with adblocking
  • ... and hosts some of the projects you see in this website!

It also hosts all the infrastructure required to support that deployment. Learning to create resilient low-maintenance infrastructure is a also a goal -- the means is an end in itself!

I made it thinking about how I would build a company's platform as an SRE. It has a mostly open-source stack with:

  • Container orchestration through Nomad (an alternative to Kubernetes)
  • Metrics, performance monitoring, and logs management through Grafana, Prometheus and Loki
  • Secure private networking through Tailscale
  • Reproducible, declarative deployments of the Linux OSs through NixOS (although sometimes I install other OSs on some machines to experiment!)
  • Service discovery and a service mesh thanks to Consul
  • Infra-as-code thanks to Terraform
  • Service-to-service communication via gRPC

In this page I write up a technical overview of how I have architected the cluster. For findings and milestones I have achieved over time, you can check out my blog (which is hosted here too!).

I aim to update this page as the fleet evolves (I started in April 2023) so that it is up-to-date. Some notable milestones include, in order:

  • from docker-compose to Nomad
  • from plaintext orchestration to the mTLS Consul service mesh
  • from hardcoded secrets to a HA Vault cluster
  • from a single Postgres node to a HA CockroachDB cluster
  • from plain Nomad HCL jobs to Nix-templated jobs
  • from no-code containers to proper a Go microservices monorepo
  • from a Wireguard mesh to a Tailscale network

Microservices

Inspired by Monzo's monorepo, my selfhosted repo includes a services/ directory where I develop Go microservices.

These are tiny (a few Go files each) and use gRPC to communicate between each other as well as the CockroachDB cluster for a backing store.

They are all built on CI on every push to master. You can find a blog post about how the automation around the builds works here.

Networking

Getting around NATs

The networking is not too complicated once you get past the VPN abstraction, which is the tricky bit to achieve without a SPOF. Some of my nodes are IPv6-capable and are IPv4-public, some are not IPv4-public, and one is neither and is behind a CGNAT and doesn't have a IPv6 address.

Contabo
London home
Madrid
Cosmo
Maco
Ari
Ziggy
Bianco
Networking setup (simplified). NATs in orange, ingress in green.

Using Tailscale as a mesh VPN has the huge advantage of a flat network topology once inside the VPN, even when some Nodes get disconnected.

Service Mesh

In a completely unnecessary impulse, I also deployed Consul to operate an Envoy service mesh.

Maco
Miki
plaintext
mTLS
plaintext
workload B
B's sidecar
workload A
A's sidecar

Note that mTLS between containers was not strictly required as all traffic is encrypted via Tailscale anyway.

Ingress

Ingress is structured as follows:

any node
non-CGNAT node
non-CGNAT node
HTTPS
HTTPS
tailscale
tailscale
container's
sidecar
container with
service's HTTP
traefik container
traefik container
🌐
Internet
Ingress setup

Where traefik is able to do mTLS with the Consul service mesh.

Orchestration

Container orchestration is done with Nomad. All nodes are clients (meaning a container can get scheduled on any node) and 3 nodes are Nomad servers (ie, the host the Nomad 'control plane', in k8s lingo).

The nodes that host Nomad servers happen to be the datacenter-hosted nodes, because they have more uptime than the ones hosted on London or Madrid.

House-hosted node
DC-hosted node
DC-hosted node
manages
manages
schedules
manages
schedules
schedules
Nomad client
🐳 Container
workload
Nomad Server
Nomad client
🐳 Container
workload
Nomad Server
Nomad client
🐳 Container
workload
Orchestration setup

All containers in the cluster are managed by Nomad. All non-containerised workloads (Nomad and Vault) are configuration-managed via NixOS.

Nomad Job page for this website

Monitoring

All of this has a Grafana monitoring stack (Loki, Mimir, Grafana, etc). I do believe it a bit of an anti-pattern to run the monitoring on top of the infrastructure you are trying to monitor, but I chose to be pragmatic.

Screenshot of a Grafana monitoring dashboard

Looking back

Overall, the setup is overkill for what I am running. I could achieve most of this with SSH and docker compose alone, but I learnt a great deal of SRE skills by trying to make a industry-grade platform that could scale to 10 or 1000 services!

If I was to found a startup tomorrow and had to develop its platform, I would reuse much of the technology I used for this side-project.