Three Days in the Field, One Quiet SSL Bug

We just got back from three days off-grid with the rig. The trip surfaced something we had not seen in the lab: Farwatch was about to lose its SSL certificate, and the renewal we thought we had wired up had quietly stopped working.

The whole point of getting the platform out of the lab and into a real campground is to find the failures that only happen when you stop watching. This is one of those.

What we found

Farwatch is the cloud companion to Headwaters. It serves the dashboard, brokers MQTT from the rig over the public internet, and signs every connection with a Let's Encrypt certificate. Let's Encrypt certificates expire after 90 days. So a cloud server you do not touch needs a renewal process that runs whether you are looking at it or not.

The original design renewed certs from a host-side cron job. A script on the host would shell into Docker, run certbot, copy the renewed files into the bind-mounted keys directory, and restart the services that consume them. It worked the first time. We assumed it would keep working.

While we were parked under pine trees with no cell signal, the cron job stopped firing on the cloud host. We caught it by accident on the way home, on a fuel stop with one bar of LTE, when the Outbound app threw a TLS error trying to reach Farwatch. The cert had not expired yet, but it was inside the renewal window and nothing was renewing it. Another two weeks and the whole cloud bridge would have gone dark.

Why a host cron is fragile

The cron job lived outside the Docker stack. That is the problem. Every other piece of Farwatch is declared in docker-compose.yml: when you bring the stack up, you get the broker, the database, the backend, and the frontend in one command. The renewal lived in a script in scripts/ with a line of installation instructions in the README. Three categories of failure follow from that:

It can drift away from the stack. Rebuild the host, restore from backup, move providers, and the cron entry is gone unless someone remembers to re-add it. Nothing in the project tells you it is missing.
It can fail silently. If the script errors, the only place it shows up is a host log file most operators never check. Certbot itself is happy. The stack is happy. The renewal just never happens.
It assumes the host can speak Docker the same way every time. Different distros, different docker compose plugins, different paths to the binary. Each one is a chance for the script to break.

What we changed

Renewal is now a service inside docker-compose.yml. It is built from a small certbot image that bundles the certbot binary and the Docker CLI together. The container runs a renewal loop every twelve hours, calls certbot renew in webroot mode through the running nginx, and when a renewal actually happens, fires a deploy-hook that copies the new files into the shared keys volume and restarts the services that need them: frontend, mosquitto, and backend.

The whole renewal subsystem now boots and lives with the rest of the stack. Bring Farwatch up with docker compose up -d and you get the renewer alongside everything else. Tear it down, move it, redeploy it, and the renewer comes with you.

A subtle detail about file writes

One thing worth calling out for anyone running a similar setup. The deploy-hook copies new cert files into a directory that is bind-mounted into multiple containers. Doing this with cp creates a fresh inode, and any container that opened the old file before the swap keeps reading the old file forever. So the hook does the writes with shell redirection instead: cat new.pem > /keys/server.crt. The redirect truncates the existing file in place and writes the new contents into the same inode. The running containers see the change. Then a controlled restart of the three TLS-consuming services picks up the new material cleanly.

Mosquitto does not hot-reload TLS at all. The backend reads the CA at startup for MQTT verification. Nginx can reload, but a full restart removes the inode-handling variable, so we restart it too. That is what the deploy-hook does, in order, every time a renewal lands.

What this means for future deployments

Operationally, less. The README is shorter. You no longer have to remember to add a cron entry or create a log directory or know which path certbot lives at on your host. You bring the stack up, and it renews itself.

Architecturally, more. This is the model we want every long-lived service in the platform to follow: if it has a maintenance task, that task belongs inside the same compose file as the service itself, declared and version-controlled with the rest. Headwaters already works this way for OTA, for log rotation, for tile updates. Farwatch is now consistent with that.

And the broader takeaway is the one that every field trip seems to deliver. Things you assume work because they worked once need a second pass with the question "what happens if nobody touches this for three months?" We have a list of things to check against that question now. Cert renewal was the first one.

The changes are live in the Farwatch repo. If you are running your own Farwatch instance, the upgrade is a git pull and a docker compose up -d --build. The old scripts/renew-certs.sh is gone. If you had it wired up in cron, you can remove the entry. The container will take it from here.