Protecting Self‑Hosted Services During Big Provider Outages: Monitoring & Alerting Cookbook
Concrete Prometheus + synthetic check recipes to detect Cloudflare/AWS outages and trigger safe failover for self‑hosted services.
Hook: Your service is up — until a CDN or cloud provider goes down
Large-scale outages at Cloudflare and AWS are no longer once-in-a-blue-moon headlines; late 2025 incidents taught operations teams a hard lesson: your self‑hosted apps can be unreachable even though your origin VM is healthy. If you rely on CDNs, DNS, or cloud-managed networking, you need targeted monitoring and automated failover playbooks that distinguish provider failures from real application faults.
This cookbook gives you concrete, production-ready recipes — Prometheus + blackbox probes, Grafana dashboards and synthetic checks, Alertmanager automation, and failover scripts for DNS/Kubernetes/Proxmox/systemd — tuned specifically to detect Cloudflare outages and AWS incidents and trigger safe, auditable failover steps.
Why this matters in 2026 (short)
Since late 2024 and through 2025 the industry consolidated around a few large CDNs and cloud providers. By 2026 multi‑CDN, edge compute, and multi‑cloud patterns are mature, but outages still happen — sometimes affecting DNS, TLS termination, or edge routing. That means your service can be functionally fine at the origin but totally unreachable to real users. The monitoring approach here focuses on detecting provider‑specific failure signals and performing controlled, reversible failover.
Detection strategy — principles
- Synthetic checks from multiple vantage points are your primary signal — uptime consoles and internal metrics are insufficient when the provider sits between users and you.
- Compare proxied vs origin: detect when the CDN/DNS path fails but the origin is healthy.
- Multi‑probe consensus: require N of M probes and cross‑region checks to avoid false positives from single‑region ISP issues.
- Automate safe failover: prefer reversible changes (DNS TTL, CNAME swap, ingress reconfiguration) over destructive actions.
Key signals to monitor
- Probe HTTP(S) status and body from proxied domain (example.com).
- Probe the direct origin hostname or IP (origin.example.internal or origin.example.com with Cloudflare disabled).
- DNS resolution failures or unexpected CNAME changes from Route53/Cloudflare APIs.
- Increased 5xx rate or latency from requests that go through AWS-managed services (S3, ALB, RDS).
- Provider status API indicators (AWS Personal Health Dashboard, Cloudflare status) as secondary signals.
Recipe 1: Prometheus + blackbox_exporter to detect Cloudflare outages
Architecture: run blackbox_exporter from at least three vantage points (home region, cloud provider A, cloud provider B) and probe both the proxied domain and the direct origin. Store metrics in Prometheus and configure alerting rules that fire only when the proxied endpoint fails across multiple probes but the direct origin stays healthy.
1) Deploy blackbox_exporter
Simple Docker example to run on a small VPS or a Kubernetes pod:
# docker run --rm -p 9115:9115 --name blackbox
-v $(pwd)/blackbox.yml:/etc/blackbox_exporter/config.yml
prom/blackbox-exporter:latest
blackbox.yml (HTTP module):
modules:
http_2xx:
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
method: GET
tls_config:
insecure_skip_verify: false
preferred_ip_protocol: "ip4"
2) Prometheus scrape_config
scrape_configs:
- job_name: "blackbox-probes"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com # proxied (Cloudflare)
- https://origin.example.internal # origin (bypass Cloudflare)
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
3) PromQL alert rule: suspect Cloudflare outage
This rule requires that the proxied endpoint fails on two different probe instances while origin is still OK.
groups:
- name: provider-outage.rules
rules:
- alert: CloudflarePathFailure
expr: |
(
sum by(instance)(probe_success{job="blackbox-probes", target=~"https://example.com"} == 0)
>= 2
)
and
(
sum by(instance)(probe_success{job="blackbox-probes", target=~"https://origin.example.internal"} == 1)
>= 1
)
for: 90s
labels:
severity: critical
annotations:
summary: "Possible Cloudflare outage affecting example.com"
description: "Proxied endpoint failing across probes but origin is healthy. Investigate CDN/DNS path."
Why this works: if the proxied domain fails everywhere but origin responds, the problem is almost certainly at the CDN, DNS, or edge layer — not your app.
Recipe 2: Detect AWS service degradation that affects your stack
AWS incidents often surface as increased latencies, 5xx responses from S3/ALB, or failing DNS resolution for Route53. Use synthetic checks to target the cloud‑managed components your app depends on.
Example probes
- GET a small object from your S3 bucket through the public URL (and via an S3 origin behind a CDN).
- TCP connect to an RDS instance endpoint and run a lightweight probe (mysqld ping using mysql_native_password or pg_isready for Postgres).
- Parse response headers from ALB/ELB for unexpected 5xx and increased H2 connection resets.
Prometheus pattern
Use blackbox for HTTP/TCP checks and exporters (mysqld_exporter, postgres_exporter) for DB health. Alert when the cloud dependency is flaky while your on‑prem alternative is healthy:
alert: AWS_S3_Degraded
expr: (
probe_success{target=~"https://s3.amazonaws.com/your-bucket/health-check"} == 0
) and (
probe_success{target=~"https://your-secondary-storage.example/health"} == 1
)
for: 2m
Actionability: on this alert you should failover read traffic to your MinIO replica or switch object reads to a cached tier, not reboot your app servers.
Failover automation — safe, reversible actions
Automated failover must be reversible, auditable, and minimally disruptive. Common safe actions:
- DNS swap (low TTL): change CNAME/A records to point to a bypass domain or secondary provider via API.
- Toggle CDN proxy: Cloudflare lets you switch proxy off for a hostname via API, exposing the origin directly.
- Ingress switch in Kubernetes: update Ingress or change ExternalDNS pointers to a different cluster.
- Start standby VM in Proxmox and update DNS to the new IP.
Alertmanager webhook + failover script
Configure Alertmanager to call a webhook that performs the DNS swap via your DNS provider API. Here's an example Alertmanager receiver and a minimal bash script to toggle a Cloudflare hostname proxy off (as a bypass) or swap a CNAME to a secondary domain.
# alertmanager.yml (snippet)
receivers:
- name: 'dns-failover'
webhook_configs:
- url: 'https://monitoring.example.com/alert-hook'
# /opt/bin/cf-bypass.sh
#!/bin/bash
set -euo pipefail
CF_ZONE_ID=""
CF_RECORD_ID=""
CF_TOKEN="${CF_API_TOKEN}"
MODE="$1" # "bypass" or "restore"
if [ "$MODE" = "bypass" ]; then
curl -s -X PATCH "https://api.cloudflare.com/client/v4/zones/${CF_ZONE_ID}/dns_records/${CF_RECORD_ID}" \
-H "Authorization: Bearer ${CF_TOKEN}" \
-H "Content-Type: application/json" \
--data '{"proxied":false}'
echo "Cloudflare bypass requested"
else
curl -s -X PATCH "https://api.cloudflare.com/client/v4/zones/${CF_ZONE_ID}/dns_records/${CF_RECORD_ID}" \
-H "Authorization: Bearer ${CF_TOKEN}" \
-H "Content-Type: application/json" \
--data '{"proxied":true}'
echo "Cloudflare restore requested"
fi
Important: protect this webhook with mTLS or HMAC verification, require human confirmation for irreversible operations, and log every change to an audit trail (Syslog/ELK or a ticketing system). Consider integrating device identity and approval workflows for stronger verification and sign-off.
Kubernetes recipes: Emergency ingress and ExternalDNS
In Kubernetes environments you can prepare an emergency Ingress (or second cluster) that exposes the app without the CDN. Use ExternalDNS or an automation job to flip DNS records when a probe determines the CDN path is down.
Emergency Ingress manifest (example)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: emergency-ingress
namespace: emergency
annotations:
kubernetes.io/ingress.class: "nginx"
spec:
rules:
- host: bypass.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: web-service
port:
number: 80
Workflow: on a CloudflarePathFailure alert, the webhook creates/updates this ingress and ExternalDNS maps bypass.example.com into DNS. Clients that resolve bypass.example.com will hit the cluster directly (bypassing Cloudflare). For small-scale or edge-hosted clusters, see guidance on edge field kits and micro-edge VPS patterns.
Proxmox & systemd: on‑prem failover playbook
If you host VMs in Proxmox, maintain a preseeded VM template for critical services and a script that clones + starts a standby VM. Tie the cloning script to Alertmanager via webhook. For simple setups use systemd to run the cloning and DNS update operations.
[Unit]
Description=Start standby VM on provider outage
After=network-online.target
[Service]
Type=oneshot
ExecStart=/opt/bin/proxmox-start-standby.sh
[Install]
WantedBy=multi-user.target
Make sure the script performs health checks after boot and reports back to Prometheus (pushgateway or node_exporter metric) so you can revert DNS only when the standby is ready. For runbooks combining cloud and on‑prem recovery, consult an incident response playbook for cloud recovery teams to design safe automation and human checkpoints.
Testing and drills — don’t wait for a real outage
Practice is everything. Schedule quarterly failover drills and automate them so your team becomes confident with the playbooks. Test both detection and remediation steps:
- Simulate CDN failure by adding a blackhole firewall rule on a probe machine or by using iptables to block Cloudflare ranges.
- Run synthetic checks that intentionally fail to verify alert routing and webhook handling.
- Use chaos testing to randomize probe locations and latency to ensure your multi‑probe consensus logic stands up to noisy networks.
Tuning alerts and avoiding noisy failover
Key tuning knobs:
- Consensus size: require multiple probe failures across different regions before triggering failover.
- Grace windows: hold alerts for 60–180s to avoid reacting to transient blips.
- Error budgets: align automatic failover to SLO thresholds — if you’re within your error budget, prefer manual escalation.
- Escalation flows: use Alertmanager routes and silence windows to coordinate human approval for non‑trivial actions.
2026 trends and how to future‑proof your strategy
As of 2026 several trends should shape your monitoring and failover design:
- Multi‑CDN and multi‑cloud by default: more teams use multiple CDNs or cloud providers for resilience; design probes and automation to operate across providers and consider governance models like community cloud co‑ops for cooperative resilience.
- Observability convergence: OpenTelemetry and distributed tracing are mainstream — add trace sampling to synthetic checks to speed root cause analysis and combine traces with an observability‑first risk lakehouse for real-time analysis.
- Synthetic as code: treat synthetic checks like unit tests in CI; version them and run them in CI pipelines. See practices in modular workflows and templates-as-code to help integrate synthetic tests into CI/CD.
- AI‑assisted Ops: incident detection and remediation are increasingly aided by AI; ensure playbooks are explicit and data‑driven before automating ML signals.
Practical checklist: what to implement this week
- Deploy blackbox_exporter in at least three locations and probe both proxied and origin hostnames.
- Create Prometheus rules that detect proxied-failure + origin-ok conditions and route them to an Alertmanager receiver named "provider-outage".
- Implement a protected webhook that can toggle DNS or CDN proxy mode; log all changes and require 2FA or human approval for destructive actions.
- Prepare an emergency ingress or bypass domain and a documented runbook to test switching traffic.
- Schedule quarterly failover drills and record the results to improve thresholds and reduce blast radius.
“Your origin being healthy isn't the same as being reachable. Detect the middleman.”
Actionable takeaways
- Detect provider faults: use multi‑site synthetic checks that compare proxied and direct origin endpoints.
- Automate safe switches: prefer DNS or proxy toggles with short TTLs and audit trails.
- Keep standby capacity: Proxmox templates, Kubernetes emergency ingress, or a secondary cloud for quick failover.
- Practice regularly: scheduled drills reveal false positives and fix runbook gaps before real incidents.
Final notes
Large CDN and cloud outages will keep happening. In 2026, the teams that win are those that instrument the edge-to-origin path and automate conservative, reversible actions when the middle layer fails. The recipes above are built to reduce noise, accelerate diagnosis, and keep your customers connected while you investigate the provider incident.
Call to action: Implement the blackbox probes and the Prometheus alert rules this week; run a dry‑run failover in a staging window, and iterate your thresholds. If you want a starter repo with Prometheus, blackbox, Alertmanager, and example failover scripts tuned to these recipes, check the monitoring templates in our community repo or contact us to run a 90‑minute workshop with your on‑call team.
Related Reading
- The Evolution of Cloud VPS in 2026: Micro‑Edge Instances for Latency‑Sensitive Apps
- How to Build an Incident Response Playbook for Cloud Recovery Teams (2026)
- Observability‑First Risk Lakehouse: Cost‑Aware Query Governance & Real‑Time Visualizations
- Future-Proofing Publishing Workflows: Modular Delivery & Templates-as-Code (useful for synthetic-as-code)
- Best Storage and Display Gear for Collectible Cards and Miniatures
- Home Recovery 2026: Integrating Infrared Therapy, On‑Device AI, and Recovery Rituals for Busy Households
- Mini-Me, Mini-Gem: Designing Matching Emerald Sets for Owners and Their Pets
- Practical Guide: Adding a Small Allocation to Agricultural Commodities in a Retail Portfolio
- Turning Memes into Merch: How Teams Can Capitalize on Viral Cultural Trends
Related Topics
selfhosting
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
From Our Network
Trending stories across our publication group