Horizon

Shen Nan4 min read
monitoringdevopsdashboardalerting
<!-- TODO: Add screenshot/video of Horizon — show the health dashboard with service status cards, a Slack alert firing, and the historical uptime chart -->

Why I built this

The operational stack spanned DNS, SSL certificates, email deliverability, Airtable automations, Pipedream workflows, ActiveCampaign, and Google Search Console. Failures in any layer were invisible until they caused outages — an expired SSL cert, a broken Pipedream flow, an email deliverability drop. There was no unified health view and no proactive alerting.

What it does

Horizon runs scheduled health checks against every service endpoint and API, normalises results into a common status schema, and aggregates them into a single dashboard. When any check breaches its threshold, a Slack alert fires immediately — surfacing failures before they become incidents.

Health checks

Check types

Each service has a tailored check:

ServiceCheck methodThreshold
DNSdig query against authoritative nameserversRecord mismatch or SERVFAIL
SSLCertificate expiry check via TLS handshake< 14 days to expiry
Email (SPF/DKIM/DMARC)DNS TXT record validationMissing or misconfigured record
Email deliverabilityPostmark reputation APIScore < 95
AirtableAPI ping + schema validation> 2s response or schema drift
PipedreamWorkflow execution status via APIAny failed execution in last 24h
ActiveCampaignAPI health + automation statusAny paused automation
Google Search ConsoleCoverage report via APIIndexed pages dropping > 10%

Scheduling

Checks run on staggered cron schedules based on criticality:

  • Every 5 minutes: SSL, DNS (outages are immediately visible to users)
  • Every hour: email deliverability, Airtable, Pipedream
  • Every 6 hours: ActiveCampaign, Google Search Console (slower-moving metrics)

Common status schema

All check results normalize to:

typescript
{
  service: string;
  status: 'healthy' | 'degraded' | 'down';
  latency_ms: number;
  message: string;
  checked_at: string;
  metadata: Record<string, unknown>;
}

This means the dashboard and alerting layer don't need to know anything about individual check implementations. Adding a new service is writing one check function that returns this schema.

Alerting

Slack integration

When any check transitions from healthy to degraded or down, a Slack webhook fires with:

  • Service name and current status
  • What specifically failed (e.g., "SSL cert for domain X expires in 3 days")
  • Link to the dashboard for details
  • Suggested remediation (e.g., "Run certbot renew" or "Check Pipedream workflow X")

Alerts are debounced — a flapping service (healthy → degraded → healthy within 10 minutes) sends one alert, not three. Recovery notifications fire when a previously-down service returns to healthy.

Incident tracking

The dashboard stores check history with 90-day retention. Each status transition is logged, enabling:

  • Uptime calculation — % of time each service was in healthy state
  • Incident timeline — when did it break, how long was it down, when was it fixed
  • Trend detection — is a service's latency gradually increasing before it fails?

Technical decisions

  • Synthetic monitoring over log-based — the fund doesn't run its own servers (everything is SaaS or serverless). Log aggregation doesn't apply. Synthetic checks from the outside are the right model.
  • Vercel cron over dedicated scheduler — Vercel's cron jobs (vercel.json cron expressions) handle the scheduling. No need for a separate cron service or always-on server.
  • Slack over PagerDuty — the team lives in Slack. Adding PagerDuty would mean another tool to check. Slack webhooks with channel-based routing (critical → #ops-alerts, informational → #ops-health) covers the use case.

Stack

Next.js 15, Vercel Cron, Slack Webhooks, Postmark API, Google Search Console API, Airtable API, Pipedream API, ActiveCampaign API, Tailwind CSS, Vercel

Shen Nan Wong

Sole engineer at Iterative (early-stage VC fund, SEA & South Asia), building data infrastructure and AI systems for investment operations. Founder of Fracxional — AI and data engineering education for enterprises and universities across Asia. Based in Ho Chi Minh City, Southeast Asia.