I build a tool that monitors servers so I don't have to think about them. Last Thursday, I spent an hour thinking about a server because my tool told me something was wrong — and then told me everything was fine.
Three sources. Three different answers.
The alert came in around midday. Server landing-app-club-prod. Message: Service open-vm-tools down. I put down my coffee and went to check.
First thing I always do: /healthcheck. The response came back clean — uptime 32 days, load 0.00, disk at 2%, memory at 4%. Then the summary line I built specifically for moments like this: All good. Then I ran /services. Twenty-four active, zero failed. Summary: All services running.
So I had an alert saying something was down. I had healthcheck saying everything was fine. I had the service list saying the same. Three surfaces, three answers, none of them agreeing. In my experience, when monitoring tells you contradictory things, one of two situations is happening: either something is genuinely broken and the monitoring is confused, or nothing is broken and the monitoring is confused. Either way, the monitoring is confused.
The irony of building an on-call tool and then not being able to trust your own on-call tool is not lost on me.
Every step produced a new contradiction.
I tried the natural language interface next. I typed: "what is the status of open-vm-tools?" — which feels like a completely reasonable thing to ask when you're trying to understand what's happening with a service.
The bot didn't show me a status. It offered me a restart.
There was a 60-second confirmation timeout. I pressed No. Not because I was sure restart was wrong, but because I had no idea whether it was right. You don't restart things you don't understand. That's how you turn a false alarm into an actual incident.
So I went manual. /exec systemctl status open-vm-tools. The response came back marked with ❌ — which in my UI means the command failed. Except the output showed something completely different:
Active: inactive (dead)
Condition: start condition unmet at Thu 2026-02-26
SRE-Type=simple
The command ran fine. The service wasn't crashed. It was inactive because a start condition was unmet. That's a completely different thing. But my UI showed ❌ anyway, because systemctl status exits with a non-zero code when a service is inactive — and I was treating non-zero exit = error, without looking at why.
I tried to disable the service to stop the alerts. Permission error. I tried through natural language. The bot suggested restart again. The session ended without resolution. I had learned a lot and fixed nothing.
inactive doesn't mean down. This took me an hour to internalize.
Here's the actual root cause, and it's embarrassingly simple once you see it.
open-vm-tools is a VMware integration daemon. It's designed to run inside VMware virtual machines and communicate with the hypervisor. My server is not in a VMware environment. So when systemd evaluated the start conditions for this service, it concluded: not applicable here, and left the service in inactive (dead) state — not because it crashed, but because the conditions for starting it were never met.
This is correct, documented behavior. The service is doing exactly what it should do on a non-VMware host: nothing.
My monitoring logic didn't know about this. Its model of the world was: active means okay, inactive means down, down means incident. There was no third state: inactive because not applicable. The system saw an inactive daemon-type service and fired an alert. Technically correct within its own rules. Completely wrong in practice.
I had fixed this problem twice before. I just hadn't finished fixing it.
This is the part that stings a little. I've run into similar situations with packagekit and e2scrub_reap — services that appear inactive on many servers but aren't actually broken. I handled those by excluding them based on service type: both are dbus and oneshot types, which aren't expected to stay running. I had learned the lesson, but only partially.
open-vm-tools uses type simple — a standard persistent daemon. That's why it wasn't excluded. The monitoring logic saw "daemon, inactive" and treated it the same as a web server that had crashed. The fix for packagekit didn't generalize to open-vm-tools because they looked structurally different, even though the underlying situation was identical: a service inactive not because it failed, but because it was never supposed to run here.
Every time I think I've solved a class of problem, I've usually solved one instance of it.
The deeper issue: action before diagnosis.
The restart offer bothered me more than anything else. When I asked "what is the status of this service," the system heard "this service might be broken" and jumped straight to mitigation. Which sounds helpful, until you realize that restarting a service you don't understand on a production server is one of the fastest ways to cause the actual outage you were worried about.
Good on-call tooling should give you diagnosis before it gives you options. The sequence should be: here's what's happening, here's why we think that, here's what you might want to do. Not: something might be wrong, want to restart it?
The ❌ on the exec output was a smaller version of the same problem. A non-zero exit code is a fact. What it means depends on context. Showing a red X without that context doesn't inform — it misleads.
What's getting fixed.
The monitoring now needs to parse ConditionResult from systemd output. If a service is inactive because a start condition was unmet — environment check failed, dependency not present, platform not matching — that is not an outage. No incident should open. No alert should fire.
I'm also adding a /status <service> command that returns something actually readable: what the service is, why it's in its current state, what the reasonable next steps are. And NL routing needs to distinguish between "tell me about X" and "fix X." Those are different requests and they should produce different responses.
The ❌/⚠️ distinction matters too. A failed command is ❌. An informative result that happened to exit non-zero is ⚠️. I'll stop conflating them.
The uncomfortable conclusion.
The alert wasn't wrong to fire. The monitoring saw an inactive daemon and did what it was configured to do. The problem was that the configuration didn't model reality well enough. inactive has more than one meaning, and I had only accounted for one.
Monitoring that generates noise trains you to ignore it. And an ignored alert is worse than no alert — because it's still consuming your attention, just not productively. The goal isn't fewer alerts. The goal is alerts where signal and noise are actually different things.
If you're running servers without a dedicated ops team and want monitoring that tries to explain before it acts, mttrly connects to your servers via Telegram and the Watchdog tier is free. You can see what it covers at /use-cases/on-call.
I'll report back when I've stopped getting paged by my own software.