Website is down: how to diagnose and fix it
A fixed diagnosis order — reachability, ports, web server, config, SSL, logs — and how the mttrly website-down recipe runs the same checks in one pass.
Direct answer
When a production website goes down, diagnose in a fixed order instead of guessing: is the server reachable, are ports 80/443 listening, is the web server process running, does the nginx config pass nginx -t, is the SSL certificate valid, and what do the last 100 lines of the error log say. mttrly packages this sequence as the website-down diagnostic recipe; restarts and reloads that would change server state run only after explicit human approval.
What you see
The browser rarely tells you the real cause. The same outage can surface as any of these:
Browser: "This site can’t be reached" / ERR_CONNECTION_REFUSED or ERR_CONNECTION_TIMED_OUT
502 Bad Gateway or 503 Service Unavailable — the web server is up but its upstream is not answering
Uptime monitor alert (UptimeRobot, Pingdom, healthcheck) while SSH still works fine
curl -I https://yoursite.com fails or hangs
nginx error log: "nginx: [emerg] unexpected ‘}’ in /etc/nginx/sites-enabled/default:47" — config syntax error after an edit or deploy
SSL warning in the browser — an expired certificate can look like "the site is down" to users
How to fix it manually
The classic SSH triage, ordered so each step narrows the cause. Steps 1–6 are read-only; only the restart changes state.
Confirm the server itself is reachable
Rule out a provider outage or network issue before debugging the box.
ping -c 3 yourserver.example.com
Check that something is listening on 80/443
If nothing listens, the web server is down or misbound.
ss -tlnp | grep -E ':80|:443'
Check the web server process
Is nginx (or your web server) actually running, and since when?
systemctl status nginx
Test the nginx configuration
A config syntax error after an edit or deploy is one of the most common causes — nginx refuses to start or reload until it is fixed.
nginx -t
Check the SSL certificate
An expired certificate makes the site unreachable for most browsers.
echo | openssl s_client -connect yoursite.com:443 2>/dev/null | openssl x509 -noout -dates
Read the last 100 lines of the error log
The error log usually names the exact failure: upstream refused, emerg, permission denied.
tail -100 /var/log/nginx/error.log journalctl -u nginx -n 100
Fix the cause, then restart or reload
Fix what the evidence points to (config typo, dead upstream, expired cert) and only then restart. Restarting without a diagnosis often just resets the timer on the same failure.
systemctl restart nginx # or: nginx -s reload after a config fix
If nginx -t reports an error, do not restart nginx — a running instance keeps serving the old config, but a restart with a broken config takes the site fully down. Fix the config first, verify with nginx -t, then reload.
How mttrly handles it
mttrly ships a diagnostic recipe named website-down that runs this checklist in one pass — from Telegram or from an AI IDE (Claude Code, Cursor, Codex) connected over MCP — and returns a structured answer instead of raw terminal output.
You ask, mttrly runs the website-down recipe
From chat or via the mttrly_run_diagnostic MCP tool. The recipe is a defined sequence in the product, not an ad-hoc shell session.
Steps 1-2 — check_port playbook (read-only)
Port 80 and port 443 availability: is anything listening where traffic arrives?
Step 3 — nginx_test playbook (read-only)
Runs the nginx configuration test and captures any [emerg] errors verbatim.
Step 4 — cert_check playbook (read-only)
SSL certificate validity and expiry dates.
Step 5 — log_tail playbook (read-only)
The last 100 lines of nginx logs, scanned for errors.
Step 6 — healthcheck + AI root-cause summary
Overall server health, then the AI condenses all signals into a likely root cause and a suggested next step.
Example workflow
A realistic session — the same scenario shown on the mttrly homepage (values illustrative):
You -> Claude Code (or Telegram):
"My website is down, can you check why?"
Claude -> mttrly:
calls mttrly_run_diagnostic { recipe: "website-down" }
mttrly returns:
- check_port: 80 open, 443 open
- nginx_test: FAILED — "nginx: [emerg] unexpected '}' in
/etc/nginx/sites-enabled/default:47"
- cert_check: certificate valid, 71 days left
- log_tail: nginx error log confirms the emerg since the last config edit
- healthcheck: server healthy otherwise
Claude -> You:
"Your nginx config has a syntax error at line 47 — likely from the last edit.
Fix the typo, then I can reload nginx via the nginx_reload playbook.
That's an approve-required action — confirm and I'll create the request."
You: fix the line, approve the reload from Telegram
mttrly: reloads nginx, verifies the site responds, records it all in the audit logRemediation, gated by approval
nginx_reload playbook — applies a corrected config. Approve-required: invoking it via mttrly_run_playbook creates a pending action that you confirm from Telegram, the dashboard, or your IDE before anything runs.
restart_service playbook — restarts nginx or an upstream app service when the process is dead or wedged. Also approve-required, with the result verified and written to the audit log.
mttrly does not edit your nginx config. The nginx_test playbook reads and reports the exact error; the fix in the file is yours to make — then the reload can go through the approval flow.
Diagnostics are read-only. Fixes need your approval.
- +mttrly investigates on its own: diagnostic recipes and read-only playbooks inspect the server without changing anything.
- +Any state-changing fix creates a pending action. Nothing executes until a human — you, not the AI — approves it from Telegram, the dashboard, or the IDE confirmation flow.
- +Every diagnostic, pending action, approval, and result is recorded in the audit log.
- +mttrly does not auto-fix production. The agent diagnoses and suggests a remediation path; the decision stays with you.
Detection tools tell you something is wrong. mttrly is the agent on your server that diagnoses the incident and prepares the fix — through scoped MCP tools, diagnostic recipes, and remediation playbooks, with human approval in front of every state change. It complements monitoring like Grafana, Datadog, or UptimeRobot; it does not replace it.
When mttrly is not enough
The website-down recipe covers what is visible from the server itself. Outside that boundary:
- -DNS and registrar problems — if the domain does not resolve, the fix lives at your DNS provider, not on the server.
- -Provider-level outages — when the host or its network is down, use the provider status page and console; the agent cannot report from an unreachable machine.
- -Application code bugs — mttrly surfaces process state and log signals, but it does not debug your application code.
- -If the agent is offline, there are no live diagnostics — restore connectivity via your provider console or SSH first.
Frequently asked questions
Can mttrly restart nginx automatically when the site goes down?
No. The website-down diagnostic recipe is read-only. Restarting or reloading nginx goes through restart_service or nginx_reload — approve-required playbooks that create a pending action and wait for a human to confirm from Telegram, the dashboard, or an IDE. Every action lands in the audit log.
Will mttrly edit my nginx config to fix the error?
No. The nginx_test playbook runs the config test and reports the exact error line, but mttrly does not edit the config file. You make the fix; mttrly can then apply it through an approval-gated reload.
What if the whole server is unreachable, not just the website?
If the server or the agent is offline, mttrly cannot run live diagnostics on it. mttrly can alert you that the agent went silent, but recovery starts at your provider console or SSH. Once the server is back, the website-down recipe can take over the diagnosis.
Can I run this from Claude Code or Cursor?
Yes. mttrly exposes 40 MCP tools; mttrly_run_diagnostic runs the website-down recipe, and mttrly_run_playbook requests remediation. The AI investigates with read-only tools on its own, and state-changing playbooks always stop at a human approval.
Related
Next time the site goes down, diagnose it in one message
Connect a server in a few minutes. Free monitoring and read-only diagnostics first; approval-gated restarts when you are ready.