Nico D'Cotta

Selfhosted Infra

cottand/selfhosted

I run a personal fleet of a several servers and few old computers in my living room which run services like

Personal DB
Personal storage and backups
VPN, with adblocking
... and hosts some of the projects you see in this website!

It also hosts all the infrastructure required to support that deployment. Learning to create resilient low-maintenance infrastructure is a also a goal -- the means is an end in itself!

I made it thinking about how I would build a company's platform as an SRE. It has a mostly open-source stack with:

Container orchestration through Nomad (an alternative to Kubernetes)
Metrics, performance monitoring, and logs management through Grafana, Prometheus and Loki
Secure private networking through Tailscale
Reproducible, declarative deployments of the Linux OSs through NixOS (although sometimes I install other OSs on some machines to experiment!)
Service discovery and a service mesh thanks to Consul
Infra-as-code thanks to Terraform
Service-to-service communication via gRPC

In this page I write up a technical overview of how I have architected the cluster. For findings and milestones I have achieved over time, you can check out my blog (which is hosted here too!).

I aim to update this page as the fleet evolves (I started in April 2023) so that it is up-to-date. Some notable milestones include, in order:

from docker-compose to Nomad
from plaintext orchestration to the mTLS Consul service mesh
from hardcoded secrets to a HA Vault cluster
from a single Postgres node to a HA CockroachDB cluster
from plain Nomad HCL jobs to Nix-templated jobs
from no-code containers to proper a Go microservices monorepo
from a Wireguard mesh to a Tailscale network

Microservices

Inspired by Monzo's monorepo, my selfhosted repo includes a services/ directory where I develop Go microservices.

These are tiny (a few Go files each) and use gRPC to communicate between each other as well as the CockroachDB cluster for a backing store.

They are all built on CI on every push to master. You can find a blog post about how the automation around the builds works here.

The networking is not too complicated once you get past the VPN abstraction, which is the tricky bit to achieve without a SPOF. Some of my nodes are IPv6-capable and are IPv4-public, some are not IPv4-public, and one is neither and is behind a CGNAT and doesn't have a IPv6 address.


Networking setup (simplified). NATs in orange, ingress in green.

Using Tailscale as a mesh VPN has the huge advantage of a flat network topology once inside the VPN, even when some Nodes get disconnected.

Service Mesh

In a completely unnecessary impulse, I also deployed Consul to operate an Envoy service mesh.

Note that mTLS between containers was not strictly required as all traffic is encrypted via Tailscale anyway.

Ingress

Ingress is structured as follows:


Ingress setup

Where traefik is able to do mTLS with the Consul service mesh.

Orchestration

Container orchestration is done with Nomad. All nodes are clients (meaning a container can get scheduled on any node) and 3 nodes are Nomad servers (ie, the host the Nomad 'control plane', in k8s lingo).

The nodes that host Nomad servers happen to be the datacenter-hosted nodes, because they have more uptime than the ones hosted on London or Madrid.


Orchestration setup

All containers in the cluster are managed by Nomad. All non-containerised workloads (Nomad and Vault) are configuration-managed via NixOS.

Nomad Job page for this website

Monitoring

All of this has a Grafana monitoring stack (Loki, Mimir, Grafana, etc). I do believe it a bit of an anti-pattern to run the monitoring on top of the infrastructure you are trying to monitor, but I chose to be pragmatic.

Screenshot of a Grafana monitoring dashboard

Looking back

Overall, the setup is overkill for what I am running. I could achieve most of this with SSH and docker compose alone, but I learnt a great deal of SRE skills by trying to make a industry-grade platform that could scale to 10 or 1000 services!

If I was to found a startup tomorrow and had to develop its platform, I would reuse much of the technology I used for this side-project.