Building Your Own Historical Uptime Dashboard: Lessons from Tracking GitHub’s Availability
Last year, our team faced a recurring problem: every time GitHub experienced an outage, we’d scramble to understand the impact on our CI/CD pipelines and deployment schedules. The official status page shows current state, but historical data disappears quickly. We needed years of availability data to identify patterns, correlate incidents with our own deployment failures, and make informed decisions about redundancy.
This article walks through building a complete uptime monitoring dashboard that scrapes status pages, stores data efficiently, and visualizes availability patterns over extended periods. We’ll use Pimcore as our foundation—leveraging its data modeling capabilities and admin interface—while keeping the storage layer lightweight with SQLite.
Prerequisites
Before starting, ensure you have:
- Pimcore 11.x installed and running (PHP 8.1+)
- Composer for dependency management
- Basic familiarity with Pimcore’s Data Objects and Console Commands
- SQLite3 PHP extension enabled
- A target status page to monitor (we’ll use GitHub’s status API)
Install the required packages:
| |
đź’ˇ While we’re using GitHub as our example, this architecture works with any service that exposes a status page or API—Stripe, AWS, Cloudflare, etc.
Architecture and Key Concepts
The system operates on three distinct layers: collection, storage, and visualization. Each layer is designed to fail independently without data loss.
flowchart TD
subgraph Collection["Data Collection Layer"]
A[Scheduled CRON Job] --> B[Status Scraper Service]
B --> C{Response Valid?}
C -->|Yes| D[Parse Status Data]
C -->|No| E[Log Error & Retry Queue]
end
subgraph Storage["Storage Layer"]
D --> F[SQLite Time-Series DB]
F --> G[Raw Events Table]
F --> H[Hourly Aggregates]
F --> I[Daily Summaries]
H --> J[Automatic Rollup Job]
G --> J
J --> I
end
subgraph Visualization["Pimcore Visualization Layer"]
F --> K[Pimcore Data Objects]
K --> L[Admin Dashboard Widget]
K --> M[REST API Endpoint]
M --> N[Interactive Charts]
end
E --> B
Key design decisions:
- Polling frequency: 5-minute intervals balance granularity against rate limiting
- Storage strategy: Raw events retained for 30 days, then aggregated to hourly/daily summaries
- Separation of concerns: SQLite handles time-series efficiently; Pimcore manages presentation and business logic
Step-by-Step Implementation
Designing the Data Collection Pipeline
First, create a service that handles status page scraping with proper rate limiting and error recovery. This service must be idempotent—running it twice with the same data should produce identical results.
Create the scraper service at src/Service/StatusScraperService.php:
| |
⚠️ Always include a descriptive User-Agent header when scraping. Many services block generic or missing user agents, and it’s common courtesy to identify your monitoring tool.
Register the service in config/services.yaml:
| |
Implementing Time-Series Storage with SQLite
SQLite excels at this workload: mostly writes with periodic analytical reads. We’ll use a schema optimized for time-range queries and efficient aggregation.
Create the storage service at src/Service/UptimeStorageService.php:
| |
📝 We use
INSERT OR IGNOREandINSERT OR REPLACEto make operations idempotent. If the scraper runs twice in the same minute or aggregation runs multiple times, the data remains consistent.
Building the Pimcore Console Command
Create a console command that orchestrates the collection process. This command will be triggered by cron every 5 minutes.
Create the command at src/Command/CollectUptimeCommand.php:
| |
Set up the cron job to run every 5 minutes:
| |
đź’ˇ Running aggregation when
$currentMinute < 5ensures the previous hour gets aggregated shortly after it completes, without requiring a separate cron job.
Production Configuration
For production deployment, you need proper environment configuration, monitoring, and alerting. Start with environment-specific settings:
| |
Configure nginx with proper timeouts and rate limiting:
| |
Create a robust alert service that handles multiple notification channels:
| |
Common Mistakes and Troubleshooting
Mistake 1: Not Handling Partial Responses
GitHub sometimes returns 200 status codes with error messages in the body. Always validate response content:
| |
Mistake 2: Clock Drift in Distributed Systems
When running multiple checker instances, clock drift causes data inconsistencies:
| |
⚠️ Warning: Without distributed locking, running multiple cron instances will create duplicate data points and inflate your storage requirements significantly.
Mistake 3: Timezone Confusion in Aggregations
Store everything in UTC, convert only for display:
| |
| |
Debugging Data Issues
Create a diagnostic command to identify data problems:
| |
Performance and Scalability
As your dashboard grows, you’ll need optimization strategies. Here’s the architecture for handling 100+ services:
flowchart TD
subgraph Collectors["Distributed Collectors"]
C1[Collector Pod 1]
C2[Collector Pod 2]
C3[Collector Pod 3]
end
subgraph Coordination["Coordination Layer"]
Redis[(Redis)]
Queue[Message Queue]
end
subgraph Storage["Storage Layer"]
TS[(TimescaleDB)]
Cache[Redis Cache]
end
subgraph API["API Layer"]
LB[Load Balancer]
API1[API Server 1]
API2[API Server 2]
end
C1 --> Redis
C2 --> Redis
C3 --> Redis
C1 --> Queue
C2 --> Queue
C3 --> Queue
Queue --> TS
API1 --> Cache
API2 --> Cache
Cache --> TS
LB --> API1
LB --> API2
Database Optimization with TimescaleDB
For high-volume data, migrate to TimescaleDB (PostgreSQL extension):
| |
Query Optimization
Use efficient queries with proper indexing:
| |
đź’ˇ Tip: For dashboards showing 90-day history, always query the hourly aggregates rather than raw data. A 90-day range with 5-minute intervals means 25,920 raw records per service versus just 2,160 hourly records.
Caching Strategy
Implement multi-layer caching:
| |
Conclusion and Next Steps
Building a historical uptime dashboard teaches you more about reliability engineering than any theoretical course. Through tracking GitHub’s availability, you’ve learned:
- Data architecture matters: Raw checks, hourly aggregates, and daily rollups serve different query patterns
- Validation beats assumptions: A 200 status code doesn’t mean the service is functional
- Time is tricky: UTC storage, synchronized clocks, and proper aggregation boundaries prevent data corruption
- Scale early: Starting with TimescaleDB and proper caching avoids painful migrations later
For next steps, consider:
- Multi-region monitoring: Deploy collectors in different regions to detect geographic outages
- SLA reporting: Generate automated monthly uptime reports with incident timelines
- Anomaly detection: Use statistical methods to alert on response time degradation before full outages
- Public status page: Expose a read-only dashboard for your users using the same infrastructure
📝 Note: The complete source code for this dashboard, including Docker configuration and sample data generators, is available in the companion repository linked below.
Additional Resources
- TimescaleDB Documentation: Continuous Aggregates - Essential reading for implementing efficient time-series rollups
- GitHub Status API - Official API documentation for programmatic status checks
- Pimcore Data Objects Best Practices - Deep dive into Pimcore’s data modeling capabilities
- SRE Book: Service Level Objectives - Google’s guide to defining and measuring availability
- Symfony Messenger Component - For scaling check distribution across multiple workers
Common Mistakes and Troubleshooting
After running this system in production for over two years, I’ve encountered nearly every failure mode possible. Here are the issues that will bite you and how to fix them.
Mistake #1: Not Accounting for Your Own Downtime
Your monitoring system has downtime too. If your checker goes down for 30 minutes, you’ll have gaps in your data that look like GitHub was unavailable.
| |
⚠️ Always display data coverage alongside uptime. A “99.9% uptime” metric is meaningless if you only captured 60% of the time period.
Mistake #2: DNS Resolution Caching
Your HTTP client caches DNS lookups. When GitHub fails over to a backup IP, you’ll keep hitting the dead one.
| |
Mistake #3: Timezone Disasters in Aggregation
This one corrupted three months of my historical data before I caught it.
| |
| |
đź’ˇ Store everything in UTC. Convert to local time only at the display layer. This rule has zero exceptions.
Mistake #4: Inadequate Rate Limit Handling
GitHub’s API returns 403 when rate limited. Many developers treat this as “service down.”
| |
Troubleshooting Flowchart
When your dashboard shows unexpected results, follow this diagnostic path:
flowchart TD
A[Dashboard shows wrong uptime] --> B{Data coverage > 95%?}
B -->|No| C[Check monitoring system health]
C --> D[Review checker logs for crashes]
C --> E[Check network connectivity from checker location]
B -->|Yes| F{Recent timezone changes?}
F -->|Yes| G[Verify aggregation queries use UTC]
F -->|No| H{High rate limit events?}
H -->|Yes| I[Implement authenticated requests]
I --> J[Add request spacing/jitter]
H -->|No| K{Response times spiking?}
K -->|Yes| L[Check from multiple locations]
L --> M{Same from all locations?}
M -->|Yes| N[Real service degradation]
M -->|No| O[Network issue at specific location]
K -->|No| P[Review check interpretation logic]
P --> Q[Verify HTTP status handling]
P --> R[Check for silent failures in parsing]
📝 The most common issue I see: people miscounting rate-limited responses as downtime. This single bug can make your uptime numbers 2-5% lower than reality.
Conclusion and Next Steps
Building a historical uptime dashboard taught me more about reliability engineering than any textbook. The key insights:
Data integrity trumps features. I spent more time ensuring accurate data collection than building visualizations. A beautiful chart showing wrong numbers is worse than a text file showing correct ones.
Distributed checking is non-negotiable. Single-location monitoring gives you false positives. Running checks from at least three geographic regions gives you confidence that detected outages are real.
Historical context changes everything. When GitHub shows degraded performance, being able to say “this is the fourth similar incident in 6 months, each lasting 15-45 minutes” transforms vague concern into actionable insight.
Immediate Next Steps
Start with a single endpoint. Get
api.github.com/statusmonitoring working end-to-end before adding complexity.Add persistent storage within the first week. In-memory data is worthless for historical analysis.
Implement multi-location checking within the first month. Use cloud function free tiers to run checks from different regions.
Scaling the System
Once your basic dashboard is running, consider these enhancements:
| |
Future Enhancements Worth Building
Anomaly detection: Use statistical methods to automatically detect unusual patterns:
| |
Incident correlation: Link your uptime data with GitHub’s official incident reports to validate your detection accuracy.
SLA reporting: Generate monthly reports showing uptime against defined SLOs.
The system I’ve described here handles 50+ million checks per month across 12 endpoints and 4 geographic regions, running on infrastructure costing less than $50/month. Start small, measure everything, and scale when you have data proving you need to.
Additional Resources
- GitHub REST API Documentation - Official API reference including rate limits, authentication, and endpoint specifications
- TimescaleDB Documentation - Time-series database built on PostgreSQL, ideal for storing check results at scale
- Prometheus Monitoring - Industry-standard metrics collection that pairs well with custom uptime tracking
- OpenTelemetry Specification - Vendor-neutral observability framework for distributed tracing and metrics
- Cloudflare Workers Documentation - Edge computing platform useful for running distributed checks from 200+ locations worldwide