Detecting Provider Outages from Your Side: Create Synthetic Tests That Mirror Real Users
MonitoringIncident ResponseCloud

Detecting Provider Outages from Your Side: Create Synthetic Tests That Mirror Real Users

sselfhosting
2026-02-12
11 min read
Advertisement

Design synthetic tests that mimic login, API, and CDN user journeys to detect Cloudflare/AWS outages fast and trigger safe failover.

When Cloudflare or AWS fails, your users don’t care about provider blame — they care that your app stops working. Build synthetic checks that mirror real user journeys (login, API calls, CDN fetchs) so you detect provider outages fast and switch to backups automatically.

Executive summary: In 2026, multi-cloud and edge adoption grew, but so did the blast radius when core providers have partial outages (see Cloudflare/AWS incidents in late 2025–early 2026). The best defense is synthetic monitoring written as code: tests that reproduce real user journeys, expose metrics, fire alerts, and trigger safe failover runbooks. This article walks you step‑by‑step through design principles, implementation patterns (Docker, Kubernetes, Proxmox, systemd), observability integrations (Prometheus, Grafana, Alertmanager), and automated failover examples (Cloudflare API, Route53) to minimize user impact.

Why synthetic monitoring matters in 2026

Real User Monitoring (RUM) shows what users experience, but RUM only reports after users are impacted. Synthetic monitoring lets you detect provider degradations before the user reports an incident and — crucially — lets you validate failover paths. Since late 2025 and into early 2026, high-profile outages involving CDNs and major cloud providers repeatedly produced partial failures: TLS handshakes, 5xx spikes, and DNS anomalies. These incidents reinforced a core truth for DevOps teams: observability plus proactive synthetic tests equals faster MTTD (mean time to detect) and MTTI (mean time to mitigate).

  • Edge and CDN failure modes: 520/521/525 (Cloudflare), TLS negotiation errors, and cache fill delays.
  • Regional cloud issues: S3 503s, API Gateway throttling, and cross‑region DNS delays in AWS.
  • Multi‑cloud and hybrid topologies: origin servers exposed through provider networks — need tests that can hit both the proxied path and direct origin.
  • Shift to synthetic monitoring as code and GitOps: tests live in repos, reviewed like infrastructure.

Design principles: make tests mirror real users

Not all health checks are equal. A TCP probe that returns 200 OK doesn’t prove a login flow or CDN fetch works. Use these principles when designing synthetic checks:

  • Journey fidelity: Model the path users take. If your app requires authentication then loads API calls and assets from CDN, your synthetic test should authenticate, call the API, and fetch a CDN asset.
  • Dual-path checks: Probe both via the provider (e.g., Cloudflare) and direct to origin. This isolates whether the provider layer is the failure point.
  • Observable outputs: Emit structured metrics, traces, and HAR snapshots — not just pass/fail. Integrate these outputs with your observability stack.
  • Idempotent and safe: Tests should not create side effects. Use read-only or test accounts with scoped tokens (or an authorization service like NebulaAuth for club-style deployments).
  • Fast failure and backoff: Keep timeouts tight (2–8s depending on operation), with exponential backoff on retries to avoid congesting the system during provider incidents.

Which user journeys to prioritize

Start with the journeys that — if they break — constitute a critical impact to your business. Typically that map looks like this:

  1. Authentication flow: Login page + token exchange. Detect SSO, OIDC, or cookie failures.
  2. API call with business logic: Authenticated API endpoint that drives the UI (e.g., /v1/user/dashboard).
  3. CDN asset fetch: Important JS/CSS/image fetched through the CDN that can break page rendering.
  4. Background jobs / webhooks: 3rd party webhooks or outgoing API dependencies (e.g., payments) that can hang if upstream is down.

Example failure modes each journey reveals

  • Login: OIDC provider unreachable, cookie domain misconfiguration, TLS error via CDN.
  • API: Cloudflare WAF blocking legitimate requests, AWS ALB returning 502 because origin is healthy but proxy fails.
  • CDN: Cache miss storms, origin fetch timeouts, or provider edge misconfiguration.
  • Webhooks: Upstream rate limiting or DNS failures for third‑party endpoints.

Implementation patterns

Below are practical ways to run synthetics in different environments. Pick one (or more) and keep tests in Git, run them from at least two locations, and feed results into your observability pipeline.

Lightweight: systemd timer on a Proxmox VM

For on‑prem teams using Proxmox, run a small VM (or two for geo redundancy) that executes checks with a systemd timer. This is simple, resilient, and keeps control in your network.

[Unit]
Description=Synthetic checks (login, api, cdn)

[Service]
Type=oneshot
ExecStart=/usr/local/bin/synthetic-run.sh

[Install]
WantedBy=multi-user.target

# synthetic-run.timer
[Unit]
Description=Run synthetic checks every minute

[Timer]
OnBootSec=30s
OnUnitActiveSec=60s

[Install]
WantedBy=timers.target

Inside /usr/local/bin/synthetic-run.sh, call small Node/k6/curl scripts, write Prometheus metrics to node_exporter textfile or push to a Pushgateway.

Containerized: Docker + Docker Compose

Quick to iterate and ideal for local testing. Build one image containing your test runner (Playwright for full browser flows, or k6 for API loads). Use a schedule in the container (cron) or run as a sidecar scheduler.

version: '3.8'
services:
  synthetics:
    image: registry.example.com/synthetics:latest
    volumes:
      - ./scripts:/opt/scripts
    environment:
      - PROM_PUSHGATEWAY=http://pushgateway:9091
    restart: always

Cloud/native: Kubernetes CronJob + ServiceMonitor

Run synthetics as a Kubernetes CronJob or Deployment that continuously probes and exposes /metrics for Prometheus. Keep the CronJob frequency aligned with your SLOs — 30–60s for critical checks, 5m for less critical ones.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: synthetic-login
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: synthetics
            image: ghcr.io/yourorg/synthetics:latest
            args: ["/opt/run-login-check.sh"]
            env:
            - name: PROMETHEUS_PUSHGATEWAY
              value: "http://pushgateway.monitoring.svc:9091"
          restartPolicy: OnFailure

Include a ServiceMonitor (Prometheus Operator) or pushgateway so Prometheus records results and you can create AlertingRules.

Test examples

Below are concise, copyable examples of the three core checks: login, API call, and CDN asset fetch.

Login flow (Playwright, Node)

// login-check.js (Playwright)
const { chromium } = require('playwright');
(async () => {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage({ timeout: 10000 });
  try {
    await page.goto('https://www.example.com/login');
    await page.fill('input[name=email]', process.env.SYNTH_EMAIL);
    await page.fill('input[name=password]', process.env.SYNTH_PWD);
    await page.click('button[type=submit]');
    await page.waitForSelector('#dashboard', { timeout: 8000 });
    console.log('LOGIN_OK');
    process.exit(0);
  } catch (e) {
    console.error('LOGIN_FAIL', e.message);
    process.exit(2);
  } finally { await browser.close(); }
})();

API call (k6 or curl)

// simple curl token call
curl -sS -X GET "https://api.example.com/v1/user/dashboard" \
  -H "Authorization: Bearer $SYNTH_TOKEN" \
  -o /tmp/synth_body.json || exit 2
jq -e '.dashboard' /tmp/synth_body.json >/dev/null || exit 3

// k6 snippet
import http from 'k6/http';
import { check } from 'k6';
export default function () {
  const res = http.get('https://api.example.com/v1/user/dashboard', {
    headers: { Authorization: `Bearer ${__ENV.SYNTH_TOKEN}` }
  });
  check(res, { 'status 200': r => r.status === 200 });
}

CDN fetch (curl with Host header; test edge vs origin)

Two checks: one via the CDN hostname (normal user path) and one direct to the origin IP (bypass provider). This isolates whether the CDN/edge is broken.

# via CDN (User path)
curl -sSI -m 8 -H "Host: www.example.com" https://cdn.example.com/path/to/critical.js | head -n 1

# direct origin (bypass Cloudflare)
curl -sSI -m 8 --resolve www.example.com:443:10.0.0.5 https://www.example.com/path/to/critical.js | head -n 1

Interpretation: if CDN fetch fails but origin succeeds, the provider layer or CDN config is likely the cause.

Exposing observability: metrics, logs, and traces

Every synthetic run should emit at least these metrics in Prometheus format:

  • synthetic_check_duration_seconds{check="login"} — histogram of durations
  • synthetic_check_success_total{check="login"} — counter of successes
  • synthetic_check_failure_total{check="login", reason="timeout|http5xx|tls"}
# /metrics example
synthetic_check_duration_seconds_bucket{check="login",le="0.1"} 0
synthetic_check_duration_seconds_bucket{check="login",le="1"} 1
synthetic_check_duration_seconds_sum{check="login"} 0.36
synthetic_check_success_total{check="login"} 372
synthetic_check_failure_total{check="login",reason="timeout"} 3

Send traces (OpenTelemetry) for full journey visibility and capture a HAR for failed browser flows to aid debugging. If you run models or decision logic as part of automation, consider secure deployments described in compliant LLM infra.

Alerting rules and SLOs

Define small, meaningful SLOs for synthetic tests. Example: 99.9% login success per 30d; alert when 2 failures within 3 consecutive checks. Keep noise low.

groups:
- name: synthetics.rules
  rules:
  - alert: SyntheticLoginFailure
    expr: increase(synthetic_check_failure_total{check="login"}[3m]) >= 2
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Synthetic login failures detected"
      runbook: "https://git.example.com/runbooks/synthetic-login"

Runbooks and automated mitigation

Every alert must have a clear runbook. Keep runbooks in the repo so they are versioned. A runbook should describe:

  • What the alert means
  • Quick checks to confirm (logs, /metrics, direct origin tests)
  • Automated mitigations you can safely run
  • Escalation steps and rollback instructions

Example runbook: CDN edge failure detected

  1. Confirm: check synthetic results — CDN fetch failures but origin resolves. Run:
# check via CDN and origin
curl -sSI -m 8 -H "Host: www.example.com" https://cdn.example.com/path | head -n 1
curl -sSI -m 8 --resolve www.example.com:443:10.0.0.5 https://www.example.com/path | head -n 1
  1. Failover option A — unproxy DNS (Cloudflare): switch proxied=false so DNS points to origin directly. Use script below.
  2. Failover option B — if you have backup origin or different CDN, switch weighted DNS or update ALIAS to point to backup.
  3. Verify by re-running synthetics and confirming success, then notify stakeholders and monitor for regressions before switching back.

Automated Cloudflare toggle (example)

Use the Cloudflare API to set a DNS record’s proxied flag to false. This removes the Cloudflare edge and points traffic directly to the origin IP (good for triage). Keep API tokens in a secret store.

# unproxy example (bash)
CF_ZONE_ID=xxxx
CF_RECORD_ID=yyyy
CF_API_TOKEN=${CF_TOKEN}

curl -s -X PATCH "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records/$CF_RECORD_ID" \
  -H "Authorization: Bearer $CF_API_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"proxied":false}'

Prefer scripted, auditable changes with a safety mechanism (require manual approval if above a threshold). For fully automated failover, gate the action behind policy and rate limits to avoid flip‑flopping in unstable networks — consider patterns from the autonomous agents literature when you design gating.

AWS Route53 weighted failover example

# shift 100% weight to backup (AWS CLI)
aws route53 change-resource-record-sets --hosted-zone-id ZONEID --change-batch '{
  "Changes": [{
    "Action":"UPSERT",
    "ResourceRecordSet":{
      "Name":"api.example.com",
      "Type":"A",
      "SetIdentifier":"backup",
      "Weight":100,
      "TTL":60,
      "ResourceRecords":[{"Value":"10.0.1.5"}]
  }}]}
'

Safe automation patterns

Automate, but safely. Use these controls:

  • Approval gates: require human approval for cross-provider DNS changes in production unless pre‑approved by policy. If you use any automated actors, follow the guidance in autonomous agent runbooks to avoid unwanted actions.
  • Rate limits and cooldowns: prevent flapping by enforcing cooldowns between automated failovers.
  • Canarying: route a small percentage first and monitor metrics before full switch.
  • Audit logs: store actions and reasons in Git or an immutable log for post incident review. Keep sensitive tokens in a secret manager and rotate; if you integrate with auth services, see solutions like NebulaAuth for patterns in auth delegation.

Case study (condensed): Detecting a partial Cloudflare outage

Imagine: your synthetic system starts reporting CDN 520s while direct origin checks succeed. Alert triggers a runbook that automatically unproxies the DNS record for the affected subdomain during business hours after a 2‑minute confirmation and a manual one‑click approval from the on‑call. Result: user traffic bypasses the failing edge, login and API flows resume quickly, engineering investigates the Cloudflare incident without customer‑facing downtime.

This model was put into practice across teams in 2025–2026 when spikes of edge/TLS failures caused partial breakages. Teams that had journey‑based synthetics and automated but auditable failover were able to reduce user‑impact MTTI from tens of minutes to under five.

Security and compliance considerations

  • Store synthetic credentials in Vault or cloud secret managers and rotate regularly.
  • Scope synthetic tokens: give the minimum permission necessary for checks.
  • Encrypt communications for synthetic agents outside your network; use mTLS where possible.
  • Keep HAR and trace artifacts containing PII in secure storage and purge on retention policy.

Operational checklist to get started (prioritized)

  1. Identify top 3 user journeys and map their provider dependencies (CDN, WAF, cloud region).
  2. Implement synthetics for those journeys in code and store in Git.
  3. Run tests from at least 2 geographically separated locations (on‑prem + cloud or two cloud regions).
  4. Expose /metrics and integrate with Prometheus + Grafana + Alertmanager.
  5. Create runbooks with automated mitigations and safety controls (approval, cooldowns).
  6. Practice the runbook in a game day exercise and iterate.

Advanced strategies and future predictions (2026+)

As of 2026, observerability is moving toward synthetic + RUM fusion: combine synthetic tests with RUM sampling to validate that your synthetic tests match real user experience. Expect to see more teams adopt:

  • Synthetic as Code + GitOps — tests in PRs, CI gating, and versioned runbooks.
  • Edge-aware testing — synthetic agents deployed in multiple CDN POPs or lightweight edge VMs to reproduce provider-dependent failures. See reviews of affordable edge bundles for small-footprint agent options.
  • Adaptive failover — ML-assisted detection that suggests the minimal failover change (e.g., unproxy a single subdomain) based on past incidents and risk scores (autonomous agents and decision systems are emerging here).
  • eBPF-based observability — deeper network layer telemetry complements synthetics to detect TLS/traceroute problems faster.

Final takeaways

  • Design synthetic checks to mirror real user journeys — login, API, and CDN fetchs reveal different failure surfaces.
  • Run dual-path checks (via provider + direct origin) to quickly isolate provider versus origin problems.
  • Integrate synthetics with Prometheus/Grafana, store tests in Git, and codify runbooks and safe automation for failover.
  • Practice. Game days and tabletop exercises expose gaps in automation and runbooks before real incidents.
“Detect fast, switch safely, and verify quickly.”

If you want, I can generate a starter repo with: Playwright login checks, k6 API scripts, a Kubernetes CronJob example, Prometheus metrics exporter, and a Cloudflare failover script (template with Vault integration). Want that repo tailored to Docker, Kubernetes, or Proxmox?

Call to action

Start today: pick one critical user journey, write a synthetic test that performs the full path (auth -> API -> CDN), run it from two locations, and hook it into Prometheus. If you want the starter repo or a runbook template for your topology (Docker, Kubernetes, or Proxmox), click through to download or request a tailored example — equip your team to detect Cloudflare/AWS outages before your users do.

Advertisement

Related Topics

#Monitoring#Incident Response#Cloud
s

selfhosting

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-04T17:10:38.893Z