Certificates are a fundamental part of the Internet’s security. At least since Let’s Encrypt, a free and automated Certificate Authority, has started its service, SSL is nearly used everywhere. To avoid Certificate issues and possible service outages, it’s a good idea to monitor the SSL certificates used by your services, especially as Let’s Encrypt certificates have a short lease time of 90 days.
I’m using Prometheus to monitor my infrastructure, and for Prometheus there are multiple ways to get started. Most of the tutorials and posts of the internet will cover the case of expired certificates, and it’s pretty easy to achieve. I prefer to use Telegraf, a plugin based metrics collector that also provides Prometheus compatible outputs, instead of dedicated Prometheus exporters. To monitor SSL certificates, I’m using the
x509_cert input plugin of Telegraf that provides a metric called
x509_cert_expiry which can be utilized to write simple alerting rules. That’s actually pretty cool already, as Prometheus will send out alerts a few weeks before the certificates would expire in case there is a problem within the automatic renewal process.
A week ago, Let’s Encrypt has informed affected users that they need to revoke faulty certificates issued and validated with the
TLS-ALPN-01 challenge. Even if I’m using the
DNS-01 for almost all of my certificates, I have also received a mail and started to look into it. Sadly, the notification mail only contained a “random” ACME registration ID, and I was not able to find the matching client. As mentioned, I don’t really use
TLS-ALPN-01, so I decided to stop the research and leave it to my monitoring to tell me which forgotten service is the evil one after the certificates were revoked. Nothing happened after the revocation, and the monitoring was not complaining. Good - well no, a user reported that one of the services is not reachable anymore and of course this was the one missing client that was using
TLS-ALPN-01 verified certificates - dang. While the issue itself was easy to resolve by a force renew of the certificate, I was still wondering why the monitoring has not caught it.
Well, this was the first time that I had to deal with revoked certificates instead of expired certificates. To be honest, I never thought about the detection of revoked certificates in my monitoring setup before, and therefore this case wasn’t covered. But it looks like a fix is also not that straight forward as expected. The used Telegraf input
x509_cert is not able to detect revoked certificates yet, and the common Prometheus
blackbox_exporter also don’t want to handle this case. The only way I have found so far is to use the
ssl_exporter that provides some revocation information of the certificates using OSCP. If you are already running multiple exporters, that might be the way to go for you. Personally, I prefer to handle as much as possible using Telegraf, so I might look into a fix for the
x509_cert during the next weeks. However, lessons learned 📘