MonitoringIncident ResponseCloud

Protecting Self‑Hosted Services During Big Provider Outages: Monitoring & Alerting Cookbook

sselfhosting

2026-02-02

9 min read

Concrete Prometheus + synthetic check recipes to detect Cloudflare/AWS outages and trigger safe failover for self‑hosted services.

Hook: Your service is up — until a CDN or cloud provider goes down

Large-scale outages at Cloudflare and AWS are no longer once-in-a-blue-moon headlines; late 2025 incidents taught operations teams a hard lesson: your self‑hosted apps can be unreachable even though your origin VM is healthy. If you rely on CDNs, DNS, or cloud-managed networking, you need targeted monitoring and automated failover playbooks that distinguish provider failures from real application faults.

This cookbook gives you concrete, production-ready recipes — Prometheus + blackbox probes, Grafana dashboards and synthetic checks, Alertmanager automation, and failover scripts for DNS/Kubernetes/Proxmox/systemd — tuned specifically to detect Cloudflare outages and AWS incidents and trigger safe, auditable failover steps.

Why this matters in 2026 (short)

Since late 2024 and through 2025 the industry consolidated around a few large CDNs and cloud providers. By 2026 multi‑CDN, edge compute, and multi‑cloud patterns are mature, but outages still happen — sometimes affecting DNS, TLS termination, or edge routing. That means your service can be functionally fine at the origin but totally unreachable to real users. The monitoring approach here focuses on detecting provider‑specific failure signals and performing controlled, reversible failover.

Detection strategy — principles

Synthetic checks from multiple vantage points are your primary signal — uptime consoles and internal metrics are insufficient when the provider sits between users and you.
Compare proxied vs origin: detect when the CDN/DNS path fails but the origin is healthy.
Multi‑probe consensus: require N of M probes and cross‑region checks to avoid false positives from single‑region ISP issues.
Automate safe failover: prefer reversible changes (DNS TTL, CNAME swap, ingress reconfiguration) over destructive actions.

Key signals to monitor

Probe HTTP(S) status and body from proxied domain (example.com).
Probe the direct origin hostname or IP (origin.example.internal or origin.example.com with Cloudflare disabled).
DNS resolution failures or unexpected CNAME changes from Route53/Cloudflare APIs.
Increased 5xx rate or latency from requests that go through AWS-managed services (S3, ALB, RDS).
Provider status API indicators (AWS Personal Health Dashboard, Cloudflare status) as secondary signals.

Recipe 1: Prometheus + blackbox_exporter to detect Cloudflare outages

Architecture: run blackbox_exporter from at least three vantage points (home region, cloud provider A, cloud provider B) and probe both the proxied domain and the direct origin. Store metrics in Prometheus and configure alerting rules that fire only when the proxied endpoint fails across multiple probes but the direct origin stays healthy.

1) Deploy blackbox_exporter

Simple Docker example to run on a small VPS or a Kubernetes pod:

# docker run --rm -p 9115:9115 --name blackbox 
  -v $(pwd)/blackbox.yml:/etc/blackbox_exporter/config.yml 
  prom/blackbox-exporter:latest

blackbox.yml (HTTP module):

modules:
  http_2xx:
    prober: http
    timeout: 10s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]
      method: GET
      tls_config:
        insecure_skip_verify: false
      preferred_ip_protocol: "ip4"

2) Prometheus scrape_config

scrape_configs:
  - job_name: "blackbox-probes"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://example.com      # proxied (Cloudflare)
        - https://origin.example.internal  # origin (bypass Cloudflare)
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

3) PromQL alert rule: suspect Cloudflare outage

This rule requires that the proxied endpoint fails on two different probe instances while origin is still OK.

groups:
- name: provider-outage.rules
  rules:
  - alert: CloudflarePathFailure
    expr: |
      (
        sum by(instance)(probe_success{job="blackbox-probes", target=~"https://example.com"} == 0)
        >= 2
      )
      and
      (
        sum by(instance)(probe_success{job="blackbox-probes", target=~"https://origin.example.internal"} == 1)
        >= 1
      )
    for: 90s
    labels:
      severity: critical
    annotations:
      summary: "Possible Cloudflare outage affecting example.com"
      description: "Proxied endpoint failing across probes but origin is healthy. Investigate CDN/DNS path."

Why this works: if the proxied domain fails everywhere but origin responds, the problem is almost certainly at the CDN, DNS, or edge layer — not your app.

Recipe 2: Detect AWS service degradation that affects your stack

AWS incidents often surface as increased latencies, 5xx responses from S3/ALB, or failing DNS resolution for Route53. Use synthetic checks to target the cloud‑managed components your app depends on.

Example probes

GET a small object from your S3 bucket through the public URL (and via an S3 origin behind a CDN).
TCP connect to an RDS instance endpoint and run a lightweight probe (mysqld ping using mysql_native_password or pg_isready for Postgres).
Parse response headers from ALB/ELB for unexpected 5xx and increased H2 connection resets.

Prometheus pattern

Use blackbox for HTTP/TCP checks and exporters (mysqld_exporter, postgres_exporter) for DB health. Alert when the cloud dependency is flaky while your on‑prem alternative is healthy:

alert: AWS_S3_Degraded
expr: (
  probe_success{target=~"https://s3.amazonaws.com/your-bucket/health-check"} == 0
) and (
  probe_success{target=~"https://your-secondary-storage.example/health"} == 1
)
for: 2m

Actionability: on this alert you should failover read traffic to your MinIO replica or switch object reads to a cached tier, not reboot your app servers.

Failover automation — safe, reversible actions

Automated failover must be reversible, auditable, and minimally disruptive. Common safe actions:

DNS swap (low TTL): change CNAME/A records to point to a bypass domain or secondary provider via API.
Toggle CDN proxy: Cloudflare lets you switch proxy off for a hostname via API, exposing the origin directly.
Ingress switch in Kubernetes: update Ingress or change ExternalDNS pointers to a different cluster.
Start standby VM in Proxmox and update DNS to the new IP.

Alertmanager webhook + failover script

Configure Alertmanager to call a webhook that performs the DNS swap via your DNS provider API. Here's an example Alertmanager receiver and a minimal bash script to toggle a Cloudflare hostname proxy off (as a bypass) or swap a CNAME to a secondary domain.

# alertmanager.yml (snippet)
receivers:
  - name: 'dns-failover'
    webhook_configs:
      - url: 'https://monitoring.example.com/alert-hook'

# /opt/bin/cf-bypass.sh
#!/bin/bash
set -euo pipefail
CF_ZONE_ID=""
CF_RECORD_ID=""
CF_TOKEN="${CF_API_TOKEN}"
MODE="$1" # "bypass" or "restore"

if [ "$MODE" = "bypass" ]; then
  curl -s -X PATCH "https://api.cloudflare.com/client/v4/zones/${CF_ZONE_ID}/dns_records/${CF_RECORD_ID}" \
    -H "Authorization: Bearer ${CF_TOKEN}" \
    -H "Content-Type: application/json" \
    --data '{"proxied":false}'
  echo "Cloudflare bypass requested"
else
  curl -s -X PATCH "https://api.cloudflare.com/client/v4/zones/${CF_ZONE_ID}/dns_records/${CF_RECORD_ID}" \
    -H "Authorization: Bearer ${CF_TOKEN}" \
    -H "Content-Type: application/json" \
    --data '{"proxied":true}'
  echo "Cloudflare restore requested"
fi

Important: protect this webhook with mTLS or HMAC verification, require human confirmation for irreversible operations, and log every change to an audit trail (Syslog/ELK or a ticketing system). Consider integrating device identity and approval workflows for stronger verification and sign-off.

Kubernetes recipes: Emergency ingress and ExternalDNS

In Kubernetes environments you can prepare an emergency Ingress (or second cluster) that exposes the app without the CDN. Use ExternalDNS or an automation job to flip DNS records when a probe determines the CDN path is down.

Emergency Ingress manifest (example)

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: emergency-ingress
  namespace: emergency
  annotations:
    kubernetes.io/ingress.class: "nginx"
spec:
  rules:
  - host: bypass.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-service
            port:
              number: 80

Workflow: on a CloudflarePathFailure alert, the webhook creates/updates this ingress and ExternalDNS maps bypass.example.com into DNS. Clients that resolve bypass.example.com will hit the cluster directly (bypassing Cloudflare). For small-scale or edge-hosted clusters, see guidance on edge field kits and micro-edge VPS patterns.

Proxmox & systemd: on‑prem failover playbook

If you host VMs in Proxmox, maintain a preseeded VM template for critical services and a script that clones + starts a standby VM. Tie the cloning script to Alertmanager via webhook. For simple setups use systemd to run the cloning and DNS update operations.

[Unit]
Description=Start standby VM on provider outage
After=network-online.target

[Service]
Type=oneshot
ExecStart=/opt/bin/proxmox-start-standby.sh

[Install]
WantedBy=multi-user.target

Make sure the script performs health checks after boot and reports back to Prometheus (pushgateway or node_exporter metric) so you can revert DNS only when the standby is ready. For runbooks combining cloud and on‑prem recovery, consult an incident response playbook for cloud recovery teams to design safe automation and human checkpoints.

Testing and drills — don’t wait for a real outage

Practice is everything. Schedule quarterly failover drills and automate them so your team becomes confident with the playbooks. Test both detection and remediation steps:

Simulate CDN failure by adding a blackhole firewall rule on a probe machine or by using iptables to block Cloudflare ranges.
Run synthetic checks that intentionally fail to verify alert routing and webhook handling.
Use chaos testing to randomize probe locations and latency to ensure your multi‑probe consensus logic stands up to noisy networks.

Tuning alerts and avoiding noisy failover

Key tuning knobs:

Consensus size: require multiple probe failures across different regions before triggering failover.
Grace windows: hold alerts for 60–180s to avoid reacting to transient blips.
Error budgets: align automatic failover to SLO thresholds — if you’re within your error budget, prefer manual escalation.
Escalation flows: use Alertmanager routes and silence windows to coordinate human approval for non‑trivial actions.

2026 trends and how to future‑proof your strategy

As of 2026 several trends should shape your monitoring and failover design:

Multi‑CDN and multi‑cloud by default: more teams use multiple CDNs or cloud providers for resilience; design probes and automation to operate across providers and consider governance models like community cloud co‑ops for cooperative resilience.
Observability convergence: OpenTelemetry and distributed tracing are mainstream — add trace sampling to synthetic checks to speed root cause analysis and combine traces with an observability‑first risk lakehouse for real-time analysis.
Synthetic as code: treat synthetic checks like unit tests in CI; version them and run them in CI pipelines. See practices in modular workflows and templates-as-code to help integrate synthetic tests into CI/CD.
AI‑assisted Ops: incident detection and remediation are increasingly aided by AI; ensure playbooks are explicit and data‑driven before automating ML signals.

Practical checklist: what to implement this week

Deploy blackbox_exporter in at least three locations and probe both proxied and origin hostnames.
Create Prometheus rules that detect proxied-failure + origin-ok conditions and route them to an Alertmanager receiver named "provider-outage".
Implement a protected webhook that can toggle DNS or CDN proxy mode; log all changes and require 2FA or human approval for destructive actions.
Prepare an emergency ingress or bypass domain and a documented runbook to test switching traffic.
Schedule quarterly failover drills and record the results to improve thresholds and reduce blast radius.

“Your origin being healthy isn't the same as being reachable. Detect the middleman.”

Actionable takeaways

Detect provider faults: use multi‑site synthetic checks that compare proxied and direct origin endpoints.
Automate safe switches: prefer DNS or proxy toggles with short TTLs and audit trails.
Keep standby capacity: Proxmox templates, Kubernetes emergency ingress, or a secondary cloud for quick failover.
Practice regularly: scheduled drills reveal false positives and fix runbook gaps before real incidents.

Final notes

Large CDN and cloud outages will keep happening. In 2026, the teams that win are those that instrument the edge-to-origin path and automate conservative, reversible actions when the middle layer fails. The recipes above are built to reduce noise, accelerate diagnosis, and keep your customers connected while you investigate the provider incident.

Call to action: Implement the blackbox probes and the Prometheus alert rules this week; run a dry‑run failover in a staging window, and iterate your thresholds. If you want a starter repo with Prometheus, blackbox, Alertmanager, and example failover scripts tuned to these recipes, check the monitoring templates in our community repo or contact us to run a 90‑minute workshop with your on‑call team.

selfhosting

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.