Costruisci un sistema di tracking uptime storico per API esterne usando Pimcore, SQLite e Node.js. Architettura completa con codice per VPS economici.

Building Your Own Historical Uptime Dashboard: Lessons from Tracking GitHub’s Availability

Monitorare l’uptime di servizi esterni è fondamentale quando il tuo business dipende da API di terze parti. Dopo aver subito tre incidenti critici in produzione causati da degradazioni di GitHub non rilevate tempestivamente, ho costruito un sistema di tracking storico che oggi raccoglie dati da oltre 18 mesi. In questo articolo condivido l’architettura e il codice per replicare questo sistema, integrandolo con Pimcore come backend di gestione e visualizzazione.

Prerequisiti

PHP 8.1+ con estensioni pdo_sqlite, curl, json
Pimcore 11.x installato e funzionante
Node.js 18+ per il frontend di visualizzazione
Composer e npm configurati
Familiarità con i Data Objects di Pimcore
Conoscenza base di SQLite e time-series data

💡 Il sistema è progettato per funzionare su un VPS da 5€/mese — nessun database server dedicato richiesto.

Architettura e Concetti Chiave

Il sistema si compone di tre layer principali: Collection (scraping rispettoso delle status page), Storage (persistenza efficiente con SQLite + aggregazioni), e Visualization (dashboard interattiva integrata in Pimcore).

flowchart TD
    subgraph Collection["Layer di Collection"]
        A[Cron Job ogni 5 min] --> B[Status Page Scraper]
        B --> C{Rate Limiter}
        C -->|OK| D[Parser JSON/HTML]
        C -->|Throttled| E[Backoff Queue]
        E --> C
    end
    
    subgraph Storage["Layer di Storage"]
        D --> F[SQLite Raw Events]
        F --> G[Aggregation Worker]
        G --> H[Hourly Rollups]
        G --> I[Daily Rollups]
        H --> J[Pimcore Data Objects]
        I --> J
    end
    
    subgraph Visualization["Layer di Visualization"]
        J --> K[REST API Controller]
        K --> L[Chart.js Dashboard]
        L --> M[Incident Timeline]
        L --> N[Trend Analysis]
    end
    
    F -.->|Backup giornaliero| O[(Flat File Archive)]

Il flusso garantisce tre proprietà essenziali:

Resilienza: il rate limiter previene ban dalle status page
Efficienza: le aggregazioni riducono i dati raw da GB a MB
Granularità: manteniamo i dati grezzi per analisi forensi

📝 GitHub espone la sua status page via API JSON su https://www.githubstatus.com/api/v2/status.json — non serve parsing HTML.

Implementazione Passo-Passo

Configurazione del Data Model in Pimcore

Prima di tutto, definiamo la struttura dati in Pimcore per memorizzare i rollup aggregati. Creiamo due classi: UptimeService per i servizi monitorati e UptimeDataPoint per i dati aggregati.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
# var/classes/definition_UptimeService.yaml
id: UptimeService
title: Uptime Service
description: Servizio esterno monitorato per uptime
parentClass: ''
implementsInterfaces: ''
listingParentClass: ''
useTraits: ''
listingUseTraits: ''
allowInherit: false
allowVariants: false
showFieldLookup: false
enableGridLocking: false
layoutDefinitions:
  fieldtype: panel
  name: pimcore_root
  type: null
  region: null
  title: null
  width: null
  height: null
  collapsible: false
  collapsed: false
  bodyStyle: null
  datatype: layout
  children:
    - fieldtype: panel
      name: Informazioni Servizio
      type: null
      region: null
      title: Informazioni Servizio
      width: null
      height: null
      collapsible: false
      collapsed: false
      bodyStyle: null
      datatype: layout
      children:
        - fieldtype: input
          name: serviceId
          title: ID Servizio
          tooltip: Identificatore univoco (es. github, gitlab)
          mandatory: true
          noteditable: false
          index: true
          locked: false
          style: ''
          width: 300
          defaultValue: null
          columnLength: 64
          regex: '^[a-z0-9_-]+$'
          regexFlags: ''
          unique: true
          showCharCount: false
          datatype: data
        - fieldtype: input
          name: statusPageUrl
          title: URL Status Page
          tooltip: Endpoint API della status page
          mandatory: true
          noteditable: false
          index: false
          locked: false
          style: ''
          width: 500
          defaultValue: null
          columnLength: 512
          regex: '^https?://'
          regexFlags: ''
          unique: false
          showCharCount: false
          datatype: data
        - fieldtype: select
          name: parserType
          title: Tipo Parser
          tooltip: Formato della status page
          mandatory: true
          noteditable: false
          index: false
          locked: false
          style: ''
          width: 200
          defaultValue: statuspage_io
          options:
            - key: StatusPage.io JSON
              value: statuspage_io
            - key: Atlassian Statuspage
              value: atlassian
            - key: Custom HTML
              value: custom_html
          datatype: data
        - fieldtype: numeric
          name: checkIntervalMinutes
          title: Intervallo Check (minuti)
          tooltip: Frequenza di polling
          mandatory: true
          noteditable: false
          index: false
          locked: false
          style: ''
          width: 150
          defaultValue: 5
          minValue: 1
          maxValue: 60
          integer: true
          unsigned: true
          decimalSize: null
          decimalPrecision: null
          unique: false
          datatype: data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# var/classes/definition_UptimeDataPoint.yaml
id: UptimeDataPoint
title: Uptime Data Point
description: Punto dati aggregato per uptime
parentClass: ''
layoutDefinitions:
  fieldtype: panel
  name: pimcore_root
  children:
    - fieldtype: panel
      name: Dati Aggregati
      children:
        - fieldtype: manyToOneRelation
          name: service
          title: Servizio
          tooltip: Riferimento al servizio monitorato
          mandatory: true
          noteditable: false
          index: true
          classes:
            - UptimeService
          datatype: data
        - fieldtype: datetime
          name: periodStart
          title: Inizio Periodo
          tooltip: Timestamp inizio aggregazione
          mandatory: true
          noteditable: false
          index: true
          defaultValue: null
          useCurrentDate: false
          datatype: data
        - fieldtype: select
          name: aggregationType
          title: Tipo Aggregazione
          mandatory: true
          options:
            - key: Oraria
              value: hourly
            - key: Giornaliera
              value: daily
            - key: Settimanale
              value: weekly
          datatype: data
        - fieldtype: numeric
          name: uptimePercentage
          title: Uptime %
          tooltip: Percentuale disponibilità nel periodo
          mandatory: true
          decimalSize: 5
          decimalPrecision: 2
          minValue: 0
          maxValue: 100
          datatype: data
        - fieldtype: numeric
          name: totalChecks
          title: Check Totali
          tooltip: Numero di check nel periodo
          mandatory: true
          integer: true
          unsigned: true
          datatype: data
        - fieldtype: numeric
          name: failedChecks
          title: Check Falliti
          tooltip: Numero di check con stato non operativo
          mandatory: true
          integer: true
          unsigned: true
          datatype: data
        - fieldtype: textarea
          name: incidentsSummary
          title: Riepilogo Incidenti
          tooltip: JSON con dettagli incidenti nel periodo
          mandatory: false
          width: 500
          height: 150
          datatype: data

Dopo aver creato le definizioni, rigenera le classi:

1
bin/console pimcore:deployment:classes-rebuild

Sistema di Scraping con Rate Limiting Intelligente

Il cuore del sistema è lo scraper che rispetta i limiti delle status page. Implementiamo un servizio Symfony che gestisce backoff esponenziale e caching.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
<?php
// src/Service/StatusPageScraperService.php

namespace App\Service;

use Pimcore\Model\DataObject\UptimeService;
use Psr\Log\LoggerInterface;
use Symfony\Contracts\Cache\CacheInterface;
use Symfony\Contracts\Cache\ItemInterface;
use Symfony\Contracts\HttpClient\HttpClientInterface;
use Symfony\Contracts\HttpClient\Exception\TransportExceptionInterface;

class StatusPageScraperService
{
    // Configurazione rate limiting per dominio
    private const RATE_LIMITS = [
        'githubstatus.com' => ['requests_per_minute' => 10, 'min_interval_seconds' => 6],
        'status.atlassian.com' => ['requests_per_minute' => 12, 'min_interval_seconds' => 5],
        'default' => ['requests_per_minute' => 6, 'min_interval_seconds' => 10],
    ];
    
    // Backoff esponenziale: tentativi e moltiplicatore
    private const MAX_RETRIES = 3;
    private const BACKOFF_MULTIPLIER = 2;
    private const INITIAL_BACKOFF_SECONDS = 5;
    
    private array $lastRequestTime = [];
    
    public function __construct(
        private readonly HttpClientInterface $httpClient,
        private readonly CacheInterface $cache,
        private readonly LoggerInterface $logger,
        private readonly string $sqliteDbPath
    ) {}
    
    /**
     * Esegue lo scraping di tutti i servizi attivi
     * Chiamato dal cron job ogni 5 minuti
     */
    public function scrapeAllServices(): array
    {
        $results = [];
        
        // Recupera tutti i servizi pubblicati
        $services = new UptimeService\Listing();
        $services->setCondition('published = 1');
        $services->setOrderKey('checkIntervalMinutes');
        $services->setOrder('ASC');
        
        foreach ($services as $service) {
            try {
                // Verifica se è il momento di fare il check
                if (!$this->shouldCheck($service)) {
                    $this->logger->debug('Skip check per {service}, non ancora tempo', [
                        'service' => $service->getServiceId()
                    ]);
                    continue;
                }
                
                $result = $this->scrapeService($service);
                $results[$service->getServiceId()] = $result;
                
                // Salva immediatamente nel database SQLite
                $this->persistRawDataPoint($service, $result);
                
            } catch (\Exception $e) {
                $this->logger->error('Errore scraping {service}: {error}', [
                    'service' => $service->getServiceId(),
                    'error' => $e->getMessage()
                ]);
                $results[$service->getServiceId()] = [
                    'success' => false,
                    'error' => $e->getMessage(),
                    'timestamp' => time()
                ];
            }
        }
        
        return $results;
    }
    
    /**
     * Scraping singolo servizio con retry e rate limiting
     */
    public function scrapeService(UptimeService $service): array
    {
        $url = $service->getStatusPageUrl();
        $domain = parse_url($url, PHP_URL_HOST);
        $rateConfig = self::RATE_LIMITS[$domain] ?? self::RATE_LIMITS['default'];
        
        // Applica rate limiting
        $this->waitForRateLimit($domain, $rateConfig['min_interval_seconds']);
        
        $attempt = 0;
        $backoffSeconds = self::INITIAL_BACKOFF_SECONDS;
        $lastException = null;
        
        while ($attempt < self::MAX_RETRIES) {
            try {
                $response = $this->httpClient->request('GET', $url, [
                    'timeout' => 10,
                    'headers' => [
                        'User-Agent' => 'UptimeTracker/1.0 (monitoring; contact@example.com)',
                        'Accept' => 'application/json',
                    ],
                ]);
                
                $statusCode = $response->getStatusCode();
                $this->lastRequestTime[$domain] = microtime(true);
                
                // Gestione rate limit response (429)
                if ($statusCode === 429) {
                    $retryAfter = $response->getHeaders(false)['retry-after'][0] ?? $backoffSeconds;
                    $this->logger->warning('Rate limited da {domain}, attendo {seconds}s', [
                        'domain' => $domain,
                        'seconds' => $retryAfter
                    ]);
                    sleep((int)$retryAfter);
                    $attempt++;
                    $backoffSeconds *= self::BACKOFF_MULTIPLIER;
                    continue;
                }
                
                if ($statusCode !== 200) {
                    throw new \RuntimeException("HTTP {$statusCode} da {$url}");
                }
                
                $content = $response->getContent();
                return $this->parseStatusPage($service, $content);
                
            } catch (TransportExceptionInterface $e) {
                $lastException = $e;
                $this->logger->warning('Tentativo {attempt}/{max} fallito per {service}: {error}', [
                    'attempt' => $attempt + 1,
                    'max' => self::MAX_RETRIES,
                    'service' => $service->getServiceId(),
                    'error' => $e->getMessage()
                ]);
                
                sleep($backoffSeconds);
                $attempt++;
                $backoffSeconds *= self::BACKOFF_MULTIPLIER;
            }
        }
        
        throw new \RuntimeException(
            "Scraping fallito dopo " . self::MAX_RETRIES . " tentativi: " . 
            ($lastException?->getMessage() ?? 'Unknown error')
        );
    }
    
    /**
     * Parser per diversi formati di status page
     */
    private function parseStatusPage(UptimeService $service, string $content): array
    {
        $parserType = $service->getParserType();
        $timestamp = time();
        
        return match($parserType) {
            'statuspage_io' => $this->parseStatusPageIo($content, $timestamp),
            'atlassian' => $this->parseAtlassian($content, $timestamp),
            'custom_html' => $this->parseCustomHtml($content, $timestamp),
            default => throw new \InvalidArgumentException("Parser sconosciuto: {$parserType}")
        };
    }
    
    /**
     * Parser specifico per StatusPage.io (usato da GitHub)
     */
    private function parseStatusPageIo(string $content, int $timestamp): array
    {
        $data = json_decode($content, true, 512, JSON_THROW_ON_ERROR);
        
        // Mappa stati StatusPage.io ai nostri stati normalizzati
        $statusMap = [
            'none' => 'operational',
            'minor' => 'degraded',
            'major' => 'partial_outage',
            'critical' => 'major_outage',
            'maintenance' => 'maintenance',
        ];
        
        $indicator = $data['status']['indicator'] ?? 'none';
        $normalizedStatus = $statusMap[$indicator] ?? 'unknown';
        
        // Estrai componenti con i loro stati
        $components = [];
        foreach ($data['components'] ?? [] as $component) {
            $components[$component['id']] = [
                'name' => $component['name'],
                'status' => $component['status'],
                'updated_at' => $component['updated_at'] ?? null,
            ];
        }
        
        // Estrai incidenti attivi
        $activeIncidents = [];
        foreach ($data['incidents'] ?? [] as $incident) {
            if ($incident['status'] !== 'resolved') {
                $activeIncidents[] = [
                    'id' => $incident['id'],
                    'name' => $incident['name'],
                    'status' => $incident['status'],
                    'impact' => $incident['impact'],
                    'created_at' => $incident['created_at'],
                ];
            }
        }
        
        return [
            'success' => true,
            'timestamp' => $timestamp,
            'status' => $normalizedStatus,
            'status_description' => $data['status']['description'] ?? '',
            'components' => $components,
            'active_incidents' => $activeIncidents,
            'raw_indicator' => $indicator,
        ];
    }
    
    /**
     * Attende il rispetto del rate limit per dominio
     */
    private function waitForRateLimit(string $domain, int $minIntervalSeconds): void
    {
        if (!isset($this->lastRequestTime[$domain])) {
            return;
        }
        
        $elapsed = microtime(true) - $this->lastRequestTime[$domain];
        $waitTime = $minIntervalSeconds - $elapsed;
        
        if ($waitTime > 0) {
            $this->logger->debug('Rate limit: attendo {wait}s per {domain}', [
                'wait' => round($waitTime, 2),
                'domain' => $domain
            ]);
            usleep((int)($waitTime * 1_000_000));
        }
    }
    
    /**
     * Verifica se è il momento di eseguire il check
     */
    private function shouldCheck(UptimeService $service): bool
    {
        $cacheKey = 'last_check_' . $service->getServiceId();
        
        return $this->cache->get($cacheKey, function (ItemInterface $item) use ($service) {
            // Prima esecuzione: esegui sempre
            $item->expiresAfter($service->getCheckIntervalMinutes() * 60);
            return true;
        });
    }
    
    /**
     * Persiste il dato raw in SQLite per analisi storiche
     */
    private function persistRawDataPoint(UptimeService $service, array $result): void
    {
        $db = new \PDO('sqlite:' . $this->sqliteDbPath);
        $db->setAttribute(\PDO::ATTR_ERRMODE, \PDO::ERRMODE_EXCEPTION);
        
        // Crea tabella se non esiste
        $db->exec('
            CREATE TABLE IF NOT EXISTS raw_status_checks (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                service_id TEXT NOT NULL,
                timestamp INTEGER NOT NULL,
                status TEXT NOT NULL,
                status_description TEXT,
                components_json TEXT,
                incidents_json TEXT,
                raw_response_hash TEXT,
                created_at TEXT DEFAULT CURRENT_TIMESTAMP
            )
        ');
        
        $db->exec('
            CREATE INDEX IF NOT EXISTS idx_service_timestamp 
            ON raw_status_checks(service_id, timestamp)
        ');
        
        $stmt = $db->prepare('
            INSERT INTO raw_status_checks 
            (service_id, timestamp, status, status_description, components_json, incidents_json, raw_response_hash)
            VALUES (?, ?, ?, ?, ?, ?, ?)
        ');
        
        $stmt->execute([
            $service->getServiceId(),
            $result['timestamp'],
            $result['status'] ?? 'unknown',
            $result['status_description'] ?? null,
            json_encode($result['components'] ?? []),
            json_encode($result['active_incidents'] ?? []),
            md5(json_encode($result)),
        ]);
    }
}

⚠️ Non ridurre l’intervallo di check sotto i 5 minuti per servizi pubblici. Rischi di essere bannato e violare i ToS.

Storage Time-Series con SQLite e Aggregazioni Automatiche

SQLite è sorprendentemente efficace per dati time-series fino a decine di milioni di righe. Il trucco è nelle aggregazioni pre-calcolate che riducono drasticamente i tempi di query.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
<?php
// src/Service/UptimeAggregationService.php

namespace App\Service;

use Pimcore\Model\DataObject\UptimeDataPoint;
use Pimcore\Model\DataObject\UptimeService;
use Psr\Log\LoggerInterface;

class UptimeAggregationService
{
    // Soglie per classificazione incidenti
    private const DEGRADED_STATUSES = ['degraded', 'partial_outage'];
    private const OUTAGE_STATUSES = ['major_outage', 'maintenance'];
    
    // Retention policy: quanti giorni di dati raw mantenere
    private const RAW_DATA_RETENTION_DAYS = 90;
    
    private \PDO $db;
    
    public function __construct(
        private readonly string $sqliteDbPath,
        private readonly LoggerInterface $logger
    ) {
        $this->db = new \PDO('sqlite:' . $this->sqliteDbPath);
        $this->db->setAttribute(\PDO::ATTR_ERRMODE, \PDO::ERRMODE_EXCEPTION);
        $this->initializeAggregationTables();
    }
    
    /**
     * Crea le tabelle per le aggregazioni se non esistono
     */
    private function initializeAggregationTables(): void
    {
        // Tabella aggregazioni orarie
        $this->db->exec('
            CREATE TABLE IF NOT EXISTS hourly_aggregations (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                service_id TEXT NOT NULL,
                hour_start INTEGER NOT NULL,
                total_checks INTEGER NOT NULL DEFAULT 0,
                operational_checks INTEGER NOT NULL DEFAULT 0,
                degraded_checks INTEGER NOT NULL DEFAULT 0,
                outage_checks INTEGER NOT NULL DEFAULT 0,
                uptime_percentage REAL NOT NULL DEFAULT 100.0,
                incidents_count INTEGER NOT NULL DEFAULT 0,
                incidents_json TEXT,
                calculated_at TEXT DEFAULT CURRENT_TIMESTAMP,
                UNIQUE(service_id, hour_start)
            )
        ');
        
        // Tabella aggregazioni giornaliere
        $this->db->exec('
            CREATE TABLE IF NOT EXISTS daily_aggregations (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                service_id TEXT NOT NULL,
                day_start INTEGER NOT NULL,
                total_checks INTEGER NOT NULL DEFAULT 0,
                operational_checks INTEGER NOT NULL DEFAULT 0,
                degraded_checks INTEGER NOT NULL DEFAULT 0,
                outage_checks INTEGER NOT NULL DEFAULT 0,
                uptime_percentage REAL NOT NULL DEFAULT 100.0,
                mttr_seconds INTEGER,
                longest_outage_seconds INTEGER,
                incidents_summary TEXT,
                calculated_at TEXT DEFAULT CURRENT_TIMESTAMP,
                UNIQUE(service_id, day_start)
            )
        ');
        
        // Indici per query veloci
        $this->db->exec('CREATE INDEX IF NOT EXISTS idx_hourly_service_hour ON hourly_aggregations(service_id, hour_start)');
        $this->db->exec('CREATE INDEX IF NOT EXISTS idx_daily_service_day ON daily_aggregations(service_id, day_start)');
    }
    
    /**
     * Calcola aggregazioni orarie per le ultime N ore
     * Chiamato ogni ora dal cron
     */
    public function calculateHourlyAggregations(int $hoursBack = 2): int
    {
        $processed = 0;
        $now = time();
        
        // Recupera tutti i service_id distinti
        $services = $this->db->query('SELECT DISTINCT service_id FROM raw_status_checks')->fetchAll(\PDO::FETCH_COLUMN);
        
        foreach ($services as $serviceId) {
            for ($i = 0; $i < $hoursBack; $i++) {
                $hourStart = strtotime(date('Y-m-d H:00:00', $now - ($i * 3600)));
                $hourEnd = $hourStart + 3600;
                
                // Query per aggregare i dati raw dell'ora
                $stmt = $this->db->prepare('
                    SELECT 
                        COUNT(*) as total_checks,
                        SUM(CASE WHEN status = "operational" THEN 1 ELSE 0 END) as operational,
                        SUM(CASE WHEN status IN ("degraded", "partial_outage") THEN 1 ELSE 0 END) as degraded,
                        SUM(CASE WHEN status IN ("major_outage", "maintenance") THEN 1 ELSE 0 END) as outage,
                        GROUP_CONCAT(DISTINCT incidents_json) as all_incidents
                    FROM raw_status_checks
                    WHERE service_id = ? 
                      AND timestamp >= ? 
                      AND timestamp < ?
                ');
                
                $stmt->execute([$serviceId, $hourStart, $hourEnd]);
                $row = $stmt->fetch(\PDO::FETCH_ASSOC);
                
                if ($row['total_checks'] == 0) {
                    continue;
                }
                
                // Calcola uptime percentage
                // Considera "degraded" come 50% disponibile
                $effectiveUptime = $row['operational'] + ($row['degraded'] * 0.5);
                $uptimePercentage = ($effectiveUptime / $row['total_checks']) * 100;
                
                // Estrai incidenti unici
                $incidents = $this->extractUniqueIncidents($row['all_incidents']);
                
                // Upsert nell'aggregazione oraria
                $upsert = $this->db->prepare('
                    INSERT INTO hourly_aggregations 
                    (service_id, hour_start, total_checks, operational_checks, degraded_checks, 
                     outage_checks, uptime_percentage, incidents_count, incidents_json)
                    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
                    ON CONFLICT(service_id, hour_start) DO UPDATE SET
                        total_checks = excluded.total_checks,
                        operational_checks = excluded.operational_checks,
                        degraded_checks = excluded.degraded_checks,
                        outage_checks = excluded.outage_checks,
                        uptime_percentage = excluded.uptime_percentage,
                        incidents_count = excluded.incidents_count,
                        incidents_json = excluded.incidents_json,
                        calculated_at = CURRENT_TIMESTAMP
                ');
                
                $upsert->execute([
                    $serviceId,
                    $hourStart,
                    $row['total_checks'],
                    $row['operational'],
                    $row['degraded'],
                    $row['outage'],
                    round($uptimePercentage, 4),
                    count($incidents),
                    json_encode($incidents)
                ]);
                
                $processed++;
            }
        }
        
        $this->logger->info('Calcolate {count} aggregazioni orarie', ['count' => $processed]);
        return $processed;
    }
    
    /**
     * Calcola aggregazioni giornaliere con metriche avanzate (MTTR, longest outage)
     */
    public function calculateDailyAggregations(int $daysBack = 1): int
    {
        $processed = 0;
        $now = time();
        
        $services = $this->db->query('SELECT DISTINCT service_id FROM hourly_aggregations')->fetchAll(\PDO::FETCH_COLUMN);
        
        foreach ($services as $serviceId) {
            for ($i = 0; $i < $daysBack; $i++) {
                $dayStart = strtotime(date('Y-m-d 00:00:00', $now - ($i * 86400)));
                $dayEnd = $dayStart + 86400;
                
                // Aggrega dalle aggregazioni orarie
                $stmt = $this->db->prepare('
                    SELECT 
                        SUM(total_checks) as total_checks,
                        SUM(operational_checks) as operational,
                        SUM(degraded_checks) as degraded,
                        SUM(outage_checks) as outage,
                        AVG(uptime_percentage) as avg_uptime,
                        SUM(incidents_count) as total_incidents
                    FROM hourly_aggregations
                    WHERE service_id = ?
                      AND hour_start >= ?
                      AND hour_start < ?
                ');
                
                $stmt->execute([$serviceId, $dayStart, $dayEnd]);
                $row = $stmt->fetch(\PDO::FETCH_ASSOC);
                
                if (!$row['total_checks']) {
                    continue;
                }
                
                // Calcola MTTR (Mean Time To Recovery) dai dati raw
                $mttrData = $this->calculateMTTR($serviceId, $dayStart, $dayEnd);
                
                // Genera summary incidenti
                $incidentsSummary = $this->generateIncidentsSummary($serviceId, $dayStart, $dayEnd);
                
                $upsert = $this->db->prepare('
                    INSERT INTO daily_aggregations
                    (service_id, day_start, total_checks, operational_checks, degraded_checks,
                     outage_checks, uptime_percentage, mttr_seconds, longest_outage_seconds, incidents_summary)
                    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
                    ON CONFLICT(service_id, day_start) DO UPDATE SET
                        total_checks = excluded.total_checks,
                        operational_checks = excluded.operational_checks,
                        degraded_checks = excluded.degraded_checks,
                        outage_checks = excluded.outage_checks,
                        uptime_percentage = excluded.uptime_percentage,
                        mttr_seconds = excluded.mttr_seconds,
                        longest_outage_seconds = excluded.longest_outage_seconds,
                        incidents_summary = excluded.incidents_summary,
                        calculated_at = CURRENT_TIMESTAMP
                ');
                
                $upsert->execute([
                    $serviceId,
                    $dayStart,
                    $row['total_checks'],
                    $row['operational'],
                    $row['degraded'],
                    $row['outage'],
                    round($row['avg_uptime'], 4),
                    $mttrData['mttr'],
                    $mttrData['longest_outage'],
                    json_encode($incidentsSummary)
                ]);
                
                //

```php
                // Cleanup vecchi record (mantieni solo ultimi 90 giorni di dati raw)
                $this->cleanupOldData($serviceId, $dayStart);
            }
            
            $this->pdo->commit();
            return true;
            
        } catch (Exception $e) {
            $this->pdo->rollBack();
            throw $e;
        }
    }
}

Configurazione per Produzione

Il passaggio a produzione richiede attenzione su diversi fronti: rate limiting, gestione connessioni, logging strutturato e deployment automatizzato.

Docker Compose per Produzione

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# docker-compose.prod.yml
version: '3.8'

services:
  uptime-collector:
    build:
      context: .
      dockerfile: Dockerfile.collector
    environment:
      - DB_HOST=postgres
      - DB_NAME=uptime_tracker
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}
      - REDIS_HOST=redis
      - CHECK_INTERVAL=60
      - LOG_LEVEL=info
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
    deploy:
      replicas: 2
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
    networks:
      - uptime-net
    healthcheck:
      test: ["CMD", "php", "healthcheck.php"]
      interval: 30s
      timeout: 10s
      retries: 3

  uptime-aggregator:
    build:
      context: .
      dockerfile: Dockerfile.aggregator
    environment:
      - DB_HOST=postgres
      - DB_NAME=uptime_tracker
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}
    depends_on:
      - postgres
    # Esegue ogni ora via cron interno
    command: ["crond", "-f", "-d", "8"]
    networks:
      - uptime-net

  dashboard-api:
    build:
      context: .
      dockerfile: Dockerfile.api
    ports:
      - "8080:8080"
    environment:
      - DB_HOST=postgres
      - DB_NAME=uptime_tracker
      - DB_USER=${DB_USER_READONLY}
      - DB_PASSWORD=${DB_PASSWORD_READONLY}
      - CACHE_TTL=300
      - REDIS_HOST=redis
    depends_on:
      - postgres
      - redis
    deploy:
      replicas: 3
    networks:
      - uptime-net

  postgres:
    image: postgres:15-alpine
    environment:
      - POSTGRES_DB=uptime_tracker
      - POSTGRES_USER=${DB_USER}
      - POSTGRES_PASSWORD=${DB_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${DB_USER} -d uptime_tracker"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - uptime-net

  redis:
    image: redis:7-alpine
    command: redis-server --appendonly yes --maxmemory 256mb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    networks:
      - uptime-net

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
      - GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-piechart-panel
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - postgres
    networks:
      - uptime-net

volumes:
  postgres_data:
  redis_data:
  grafana_data:

networks:
  uptime-net:
    driver: bridge

💡 Usa un utente database read-only per l’API pubblica. Limita i danni in caso di SQL injection.

API REST con Caching Intelligente

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
// src/api/routes/uptime.ts
import { Router, Request, Response } from 'express';
import { Pool } from 'pg';
import Redis from 'ioredis';

const router = Router();
const pool = new Pool({
  host: process.env.DB_HOST,
  database: process.env.DB_NAME,
  user: process.env.DB_USER,
  password: process.env.DB_PASSWORD,
  max: 20,
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,
});

const redis = new Redis({
  host: process.env.REDIS_HOST,
  port: 6379,
  retryDelayOnFailover: 100,
});

// Middleware per cache
const cacheMiddleware = (ttlSeconds: number) => {
  return async (req: Request, res: Response, next: Function) => {
    const cacheKey = `uptime:${req.originalUrl}`;
    
    try {
      const cached = await redis.get(cacheKey);
      if (cached) {
        res.setHeader('X-Cache', 'HIT');
        return res.json(JSON.parse(cached));
      }
    } catch (err) {
      // Redis down? Continua senza cache
      console.warn('Redis unavailable, skipping cache');
    }
    
    // Salva la risposta originale
    const originalJson = res.json.bind(res);
    res.json = (data: any) => {
      // Cache in background, non bloccare la risposta
      redis.setex(cacheKey, ttlSeconds, JSON.stringify(data)).catch(() => {});
      res.setHeader('X-Cache', 'MISS');
      return originalJson(data);
    };
    
    next();
  };
};

// GET /api/v1/services/:id/uptime
// Ritorna uptime degli ultimi N giorni
router.get(
  '/services/:id/uptime',
  cacheMiddleware(300), // 5 minuti di cache
  async (req: Request, res: Response) => {
    const { id } = req.params;
    const days = Math.min(parseInt(req.query.days as string) || 30, 365);
    
    const query = `
      SELECT 
        date,
        total_checks,
        operational_checks,
        degraded_checks,
        outage_checks,
        average_uptime,
        mttr_seconds,
        longest_outage_seconds
      FROM daily_summary
      WHERE service_id = $1
        AND date >= CURRENT_DATE - INTERVAL '${days} days'
      ORDER BY date DESC
    `;
    
    try {
      const result = await pool.query(query, [id]);
      
      // Calcola statistiche aggregate
      const stats = calculateAggregateStats(result.rows);
      
      res.json({
        service_id: id,
        period: {
          start: result.rows[result.rows.length - 1]?.date,
          end: result.rows[0]?.date,
          days: days
        },
        aggregate: stats,
        daily: result.rows
      });
    } catch (err) {
      console.error('Query failed:', err);
      res.status(500).json({ error: 'Internal server error' });
    }
  }
);

// GET /api/v1/services/:id/incidents
// Ritorna gli incidenti recenti
router.get(
  '/services/:id/incidents',
  cacheMiddleware(60), // 1 minuto - dati più freschi
  async (req: Request, res: Response) => {
    const { id } = req.params;
    const limit = Math.min(parseInt(req.query.limit as string) || 20, 100);
    
    const query = `
      SELECT 
        id,
        started_at,
        resolved_at,
        duration_seconds,
        severity,
        affected_components,
        root_cause,
        CASE 
          WHEN resolved_at IS NULL THEN 'ongoing'
          ELSE 'resolved'
        END as status
      FROM incidents
      WHERE service_id = $1
      ORDER BY started_at DESC
      LIMIT $2
    `;
    
    const result = await pool.query(query, [id, limit]);
    res.json({ incidents: result.rows });
  }
);

function calculateAggregateStats(rows: any[]): object {
  if (rows.length === 0) return {};
  
  const totalChecks = rows.reduce((sum, r) => sum + r.total_checks, 0);
  const operationalChecks = rows.reduce((sum, r) => sum + r.operational_checks, 0);
  const totalMttr = rows.filter(r => r.mttr_seconds > 0);
  
  return {
    overall_uptime: totalChecks > 0 
      ? ((operationalChecks / totalChecks) * 100).toFixed(4) 
      : null,
    average_mttr_seconds: totalMttr.length > 0
      ? Math.round(totalMttr.reduce((sum, r) => sum + r.mttr_seconds, 0) / totalMttr.length)
      : null,
    worst_day: rows.reduce((worst, r) => 
      r.average_uptime < (worst?.average_uptime ?? 100) ? r : worst, null),
    total_outage_minutes: Math.round(
      rows.reduce((sum, r) => sum + (r.outage_checks * 60), 0) / 60
    )
  };
}

export default router;

Architettura del Sistema di Produzione

flowchart TD
    subgraph External["Servizi Esterni"]
        GH[GitHub Status API]
        AWS[AWS Health API]
        CF[Cloudflare Status]
    end

    subgraph Collectors["Layer di Raccolta"]
        C1[Collector Pod 1]
        C2[Collector Pod 2]
        C3[Collector Pod 3]
    end

    subgraph DataLayer["Layer Dati"]
        PG[(PostgreSQL Primary)]
        PGR[(PostgreSQL Replica)]
        RD[(Redis Cache)]
    end

    subgraph API["Layer API"]
        LB[Load Balancer]
        A1[API Pod 1]
        A2[API Pod 2]
        A3[API Pod 3]
    end

    subgraph Processing["Background Jobs"]
        AGG[Aggregator Cron]
        ALERT[Alert Manager]
    end

    GH --> C1 & C2 & C3
    AWS --> C1 & C2 & C3
    CF --> C1 & C2 & C3

    C1 & C2 & C3 --> PG
    PG --> PGR
    PG --> AGG
    AGG --> PG
    PG --> ALERT

    LB --> A1 & A2 & A3
    A1 & A2 & A3 --> RD
    RD -.->|cache miss| PGR

    ALERT -->|webhook| SLACK[Slack]
    ALERT -->|email| EMAIL[Email Service]

Errori Comuni e Troubleshooting

1. Rate Limiting delle API Esterne

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# utils/rate_limiter.py
import time
from functools import wraps
from collections import defaultdict
import threading

class AdaptiveRateLimiter:
    """
    Rate limiter che si adatta automaticamente ai response header
    delle API monitorate.
    """
    
    def __init__(self):
        self._locks = defaultdict(threading.Lock)
        self._state = defaultdict(lambda: {
            'remaining': 100,
            'reset_at': 0,
            'backoff': 1
        })
    
    def wait_if_needed(self, service: str) -> None:
        with self._locks[service]:
            state = self._state[service]
            now = time.time()
            
            # Reset backoff se è passato il tempo
            if now >= state['reset_at']:
                state['remaining'] = 100
                state['backoff'] = 1
            
            # Se siamo quasi al limite, aspetta
            if state['remaining'] <= 5:
                sleep_time = max(0, state['reset_at'] - now)
                if sleep_time > 0:
                    time.sleep(sleep_time + 1)
    
    def update_from_response(self, service: str, headers: dict) -> None:
        """Aggiorna lo stato basandosi sugli header di risposta."""
        with self._locks[service]:
            state = self._state[service]
            
            # Header comuni per rate limiting
            if 'X-RateLimit-Remaining' in headers:
                state['remaining'] = int(headers['X-RateLimit-Remaining'])
            if 'X-RateLimit-Reset' in headers:
                state['reset_at'] = int(headers['X-RateLimit-Reset'])
            
            # Retry-After per 429
            if 'Retry-After' in headers:
                retry_after = int(headers['Retry-After'])
                state['reset_at'] = time.time() + retry_after
                state['remaining'] = 0
    
    def handle_error(self, service: str, status_code: int) -> int:
        """Gestisce errori con exponential backoff."""
        with self._locks[service]:
            state = self._state[service]
            
            if status_code == 429:
                # Rate limited: backoff aggressivo
                state['backoff'] = min(state['backoff'] * 2, 300)
            elif status_code >= 500:
                # Server error: backoff moderato
                state['backoff'] = min(state['backoff'] * 1.5, 60)
            
            sleep_time = state['backoff']
            time.sleep(sleep_time)
            return sleep_time

# Utilizzo nel collector
rate_limiter = AdaptiveRateLimiter()

def check_service(service_id: str, url: str) -> dict:
    rate_limiter.wait_if_needed(service_id)
    
    try:
        response = requests.get(url, timeout=10)
        rate_limiter.update_from_response(service_id, response.headers)
        
        if response.status_code == 429:
            rate_limiter.handle_error(service_id, 429)
            return {'status': 'rate_limited', 'retry': True}
        
        return parse_status_response(response)
        
    except requests.exceptions.Timeout:
        return {'status': 'timeout', 'response_time': 10000}

⚠️ Non fidarti mai dei rate limit documentati. GitHub dice 60 req/h per utenti non autenticati, ma applica limiti più stretti durante picchi di traffico.

2. Drift Temporale nei Check

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Query per identificare gap nei dati di monitoraggio
WITH expected_checks AS (
    SELECT 
        generate_series(
            date_trunc('hour', NOW() - INTERVAL '24 hours'),
            date_trunc('hour', NOW()),
            INTERVAL '1 minute'
        ) AS expected_time
),
actual_checks AS (
    SELECT 
        date_trunc('minute', checked_at) AS check_time,
        COUNT(*) AS check_count
    FROM status_checks
    WHERE checked_at >= NOW() - INTERVAL '24 hours'
    GROUP BY 1
)
SELECT 
    e.expected_time,
    COALESCE(a.check_count, 0) AS actual_count,
    CASE 
        WHEN a.check_count IS NULL THEN 'MISSING'
        WHEN a.check_count < 3 THEN 'PARTIAL'  -- Aspettiamo 5 servizi
        ELSE 'OK'
    END AS status
FROM expected_checks e
LEFT JOIN actual_checks a ON e.expected_time = a.check_time
WHERE COALESCE(a.check_count, 0) < 5
ORDER BY e.expected_time;

3. Problemi di Connessione al Database

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
<?php
// src/Database/ResilientConnection.php

class ResilientConnection
{
    private ?PDO $pdo = null;
    private int $maxRetries = 3;
    private array $config;
    
    public function __construct(array $config)
    {
        $this->config = $config;
    }
    
    public function getConnection(): PDO
    {
        if ($this->pdo !== null && $this->isConnectionAlive()) {
            return $this->pdo;
        }
        
        $lastException = null;
        
        for ($attempt = 1; $attempt <= $this->maxRetries; $attempt++) {
            try {
                $this->pdo = new PDO(
                    $this->buildDsn(),
                    $this->config['user'],
                    $this->config['password'],
                    [
                        PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
                        PDO::ATTR_TIMEOUT => 5,
                        PDO::ATTR_PERSISTENT => false, // Evita connessioni zombie
                    ]
                );
                
                // Test immediato della connessione
                $this->pdo->query('SELECT 1');
                
                return $this->pdo;
                
            } catch (PDOException $e) {
                $lastException = $e;
                $this->pdo = null;
                
                // Log dell'errore
                error_log(sprintf(
                    '[DB] Connection attempt %d/%d failed: %s',
                    $attempt,
                    $this->maxRetries,
                    $e->getMessage()
                ));
                
                // Backoff esponenziale
                if ($attempt < $this->maxRetries) {
                    usleep(pow(2, $attempt) * 100000); // 200ms, 400ms, 800ms
                }
            }
        }
        
        throw new RuntimeException(
            'Database connection failed after ' . $this->maxRetries . ' attempts',
            0,
            $lastException
        );
    }
    
    private function isConnectionAlive(): bool
    {
        try {
            $this->pdo->query('SELECT 1');
            return true;
        } catch (PDOException $e) {
            return false;
        }
    }
    
    private function buildDsn(): string
    {
        return sprintf(
            'pgsql:host=%s;port=%d;dbname=%s;connect_timeout=5',
            $this->config['host'],
            $this->config['port'] ?? 5432,
            $this->config['database']
        );
    }
}

📝 PostgreSQL chiude le connessioni idle dopo idle_session_timeout. Con pool di connessioni lunghe, implementa sempre un health check periodico.

Performance e Scalabilità

Indici Ottimizzati per Query Frequenti

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
-- Indici critici per performance
-- 1. Query per dashboard: uptime degli ultimi N giorni per servizio
CREATE INDEX CONCURRENTLY idx_daily_summary_service_date 
ON daily_summary (service_id, date DESC);

-- 2. Query per trend: aggregazioni temporali
CREATE INDEX CONCURRENTLY idx_status_checks_time_service 
ON status_checks (checked_at DESC, service_id) 
INCLUDE (status, response_time_ms);

-- 3. Query per incidenti attivi
CREATE INDEX CONCURRENTLY idx_incidents_active 
ON incidents (service_id, started_at DESC) 
WHERE resolved_at IS NULL;

-- 4. Partizionamento per dati storici (PostgreSQL 12+)
-- Partiziona per mese i dati raw
CREATE TABLE status_checks_partitioned (
    LIKE status_checks INCLUDING ALL
) PARTITION BY RANGE (checked_at);

-- Crea partizioni automaticamente
CREATE OR REPLACE FUNCTION create_monthly_partition()
RETURNS void AS $$
DECLARE
    partition_date DATE;
    partition_name TEXT;
    start_date DATE;
    end_date DATE;
BEGIN
    -- Prossimo mese
    partition_date := date_trunc('month', CURRENT_DATE + INTERVAL '1 month');
    partition_name := 'status_checks_' || to_char(partition_date, 'YYYY_MM');
    start_date := partition_date;
    end_date := partition_date + INTERVAL '1 month';
    
    -- Crea se non esiste
    IF NOT EXISTS (
        SELECT 1 FROM pg_tables 
        WHERE tablename = partition_name
    ) THEN
        EXECUTE format(
            'CREATE TABLE %I PARTITION OF status_checks_partitioned
             FOR VALUES FROM (%L) TO (%L)',
            partition_name, start_date, end_date
        );
        
        RAISE NOTICE 'Created partition: %', partition_name;
    END IF;
END;
$$ LANGUAGE plpgsql;

-- Schedula con pg_cron
SELECT cron.schedule('create-partitions', '0 0 25 * *', 
    'SELECT create_monthly_partition()');

Benchmark e Metriche

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# k6/load-test.js convertito in config YAML per chiarezza
# Esegui con: k6 run --config k6-config.yaml load-test.js

# k6-config.yaml
scenarios:
  dashboard_load:
    executor: ramping-vus
    startVUs: 0
    stages:
      - duration: 2m
        target: 50    # Ramp up
      - duration: 5m
        target: 50    # Stato stazionario
      - duration: 2m
        target: 100   # Picco
      - duration: 2m
        target: 0     # Ramp down

thresholds:
  http_req_duration:
    - p(95) < 200     # 95% delle richieste sotto 200ms
    - p(99) < 500     # 99% sotto 500ms
  http_req_failed:
    - rate < 0.01     # Meno dell'1% di errori
  
  # Threshold custom per endpoint specifici
  http_req_duration{endpoint:uptime}:
    - p(95) < 100     # Endpoint uptime deve essere veloce
  http_req_duration{endpoint:incidents}:
    - p(95) < 150

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// k6/load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';

// Metriche custom
const uptimeLatency = new Trend('uptime_endpoint_latency');
const cacheHitRate = new Rate('cache_hit_rate');

const BASE_URL = __ENV.API_URL || 'http://localhost:8080';
const SERVICES = ['github', 'aws', 'cloudflare', 'vercel', 'netlify'];

export default function() {
    // Simula pattern reale: dashboard carica dati di più servizi
    const service = SERVICES[Math.floor(Math.random() * SERVICES.length)];
    
    // 1. Carica uptime (chiamata più frequente)
    const uptimeRes = http.get(
        `${BASE_URL}/api/v1/services/${service}/uptime?days=30`,
        { tags: { endpoint: 'uptime' } }
    );
    
    uptimeLatency.add(uptimeRes.timings.duration);
    cacheHitRate.add(uptimeRes.headers['X-Cache'] === 'HIT' ? 1 : 0);
    
    check(uptimeRes, {
        'uptime status 200': (r) => r.status === 200,
        'uptime has data': (r) => JSON.parse(r.body).daily.length > 0,
        'uptime latency ok': (r) => r.timings.duration < 200,
    });
    
    // 2. Carica incidenti (meno frequente)
    if (Math.random() < 0.3) {
        const incidentsRes = http.get(
            `${BASE_URL}/api/v1/services/${service}/incidents?limit=10`,
            { tags: { endpoint: 'incidents' } }
        );
        
        check(incidentsRes, {
            'incidents status 200': (r) => r.status === 200,
        });
    }
    
    sleep(Math.random() * 2 + 1); // 1-3 secondi tra richieste
}

Conclusioni e Next Steps

Costruire un sistema di monitoraggio uptime robusto richiede attenzione su più fronti:

Quello che abbiamo coperto:

Schema database ottimizzato per time-series con aggregazioni pre-calcolate
Collector resiliente con gestione rate limiting adattivo
API con caching multi-livello
Deployment containerizzato production-ready

Prossimi passi consigliati:

Alerting intelligente: Integra con PagerDuty/Opsgenie usando soglie dinamiche basate su baseline storiche, non valori fissi
Correlazione incidenti: Quando GitHub va down, spesso impatta CI/CD di altri servizi. Implementa detection di incidenti correlati
Predizione: Con 6+ mesi di dati, puoi trainare modelli semplici per predire degradation basandosi su pattern temporali
SLA Reporting: Genera report automatici mensili per stakeholder, con calcolo preciso di SLA breach e crediti dovuti

💡 Il vero valore di un sistema di uptime tracking emerge dopo mesi di dati. Resisti alla tentazione di aggiungere feature prima di avere una base solida di dati storici.

Risorse Aggiuntive

Atlassian Statuspage API Documentation - Reference per integrare altri status page che usano la stessa piattaforma
PostgreSQL Time-Series Best Practices - Confronto approfondito tra PostgreSQL vanilla e TimescaleDB per workload time-series
Prometheus Alerting Rules - Pattern per alerting che puoi adattare al tuo sistema custom
Google SRE Book - Chapter 4: Service Level Objectives - Fondamentali su come definire e misurare SLO/SLI
GitHub Status API - Documentazione ufficiale dell’API che abbiamo usato come esempio principale

Errori Comuni e Troubleshooting

1. Rate Limiting Non Gestito

L’errore più frequente quando si interrogano API di status è ignorare i limiti di rate.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
// ❌ SBAGLIATO: chiamate senza controllo
async function checkAllServices() {
  const services = ['github', 'gitlab', 'bitbucket', 'azure'];
  // Questo può triggerare rate limiting immediato
  const results = await Promise.all(
    services.map(s => fetchStatus(s))
  );
  return results;
}

// ✅ CORRETTO: implementa backoff esponenziale e throttling
class RateLimitedFetcher {
  private requestQueue: Array<() => Promise<void>> = [];
  private processing = false;
  private readonly minInterval = 1000; // 1 secondo tra richieste

  async fetch(url: string, retries = 3): Promise<Response> {
    for (let attempt = 0; attempt < retries; attempt++) {
      try {
        const response = await this.throttledFetch(url);
        
        if (response.status === 429) {
          // Leggi header Retry-After se presente
          const retryAfter = response.headers.get('Retry-After');
          const waitTime = retryAfter 
            ? parseInt(retryAfter) * 1000 
            : Math.pow(2, attempt) * 1000;
          
          console.warn(`Rate limited. Attendo ${waitTime}ms...`);
          await this.sleep(waitTime);
          continue;
        }
        
        return response;
      } catch (error) {
        if (attempt === retries - 1) throw error;
        await this.sleep(Math.pow(2, attempt) * 1000);
      }
    }
    throw new Error('Max retries exceeded');
  }

  private async throttledFetch(url: string): Promise<Response> {
    return new Promise((resolve, reject) => {
      this.requestQueue.push(async () => {
        try {
          const result = await fetch(url);
          resolve(result);
        } catch (e) {
          reject(e);
        }
      });
      this.processQueue();
    });
  }

  private async processQueue() {
    if (this.processing || this.requestQueue.length === 0) return;
    this.processing = true;
    
    while (this.requestQueue.length > 0) {
      const request = this.requestQueue.shift()!;
      await request();
      await this.sleep(this.minInterval);
    }
    
    this.processing = false;
  }

  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

⚠️ Warning: GitHub Status API non ha rate limit documentati, ma altri provider (come Atlassian) limitano a 100 richieste/minuto. Implementa sempre throttling preventivo.

2. Timezone Hell nei Calcoli di Uptime

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// ❌ SBAGLIATO: calcoli naive che ignorano timezone
function calculateDailyUptime(incidents: Incident[]): number {
  const today = new Date();
  const startOfDay = new Date(today.setHours(0, 0, 0, 0));
  // Questo usa il timezone locale del server!
  // Se il server è in UTC e l'utente in CET, i dati saranno sfasati
}

// ✅ CORRETTO: usa sempre UTC internamente, converti solo per display
import { zonedTimeToUtc, utcToZonedTime, format } from 'date-fns-tz';

interface UptimeCalculator {
  calculateForPeriod(
    incidents: Incident[],
    startUtc: Date,
    endUtc: Date
  ): UptimeMetrics;
}

class TimezoneAwareUptimeCalculator implements UptimeCalculator {
  // Tutti i calcoli interni in UTC
  calculateForPeriod(
    incidents: Incident[],
    startUtc: Date,
    endUtc: Date
  ): UptimeMetrics {
    const totalMinutes = (endUtc.getTime() - startUtc.getTime()) / 60000;
    
    let downtimeMinutes = 0;
    
    for (const incident of incidents) {
      // Assicurati che le date degli incident siano in UTC
      const incidentStart = new Date(incident.created_at);
      const incidentEnd = incident.resolved_at 
        ? new Date(incident.resolved_at)
        : endUtc;
      
      // Calcola overlap con il periodo richiesto
      const overlapStart = Math.max(incidentStart.getTime(), startUtc.getTime());
      const overlapEnd = Math.min(incidentEnd.getTime(), endUtc.getTime());
      
      if (overlapStart < overlapEnd) {
        downtimeMinutes += (overlapEnd - overlapStart) / 60000;
      }
    }
    
    return {
      uptimePercent: ((totalMinutes - downtimeMinutes) / totalMinutes) * 100,
      downtimeMinutes,
      totalMinutes,
      periodStart: startUtc.toISOString(),
      periodEnd: endUtc.toISOString()
    };
  }

  // Converti per display in timezone utente
  formatForDisplay(date: Date, userTimezone: string): string {
    const zonedDate = utcToZonedTime(date, userTimezone);
    return format(zonedDate, 'dd/MM/yyyy HH:mm zzz', { timeZone: userTimezone });
  }
}

💡 Tip: Salva SEMPRE i timestamp in UTC nel database. Converti in timezone locale solo al momento del rendering nella UI.

3. Memory Leak nel Polling Continuo

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
// ❌ SBAGLIATO: interval che non viene mai pulito
class StatusMonitor {
  start() {
    setInterval(() => {
      this.checkStatus(); // Se checkStatus è async e lenta, si accumulano
    }, 30000);
  }
}

// ✅ CORRETTO: gestione corretta del lifecycle
class ManagedStatusMonitor {
  private intervalId: NodeJS.Timeout | null = null;
  private isChecking = false;
  private abortController: AbortController | null = null;

  async start(): Promise<void> {
    if (this.intervalId) {
      console.warn('Monitor già avviato');
      return;
    }

    // Prima esecuzione immediata
    await this.checkStatus();

    // Poi scheduling regolare
    this.intervalId = setInterval(async () => {
      // Evita sovrapposizioni se il check precedente è ancora in corso
      if (this.isChecking) {
        console.debug('Check precedente ancora in corso, skip');
        return;
      }
      await this.checkStatus();
    }, 30000);

    console.log('Monitor avviato');
  }

  async stop(): Promise<void> {
    // Cancella interval
    if (this.intervalId) {
      clearInterval(this.intervalId);
      this.intervalId = null;
    }

    // Abort richieste in corso
    if (this.abortController) {
      this.abortController.abort();
      this.abortController = null;
    }

    // Attendi che il check corrente finisca
    while (this.isChecking) {
      await new Promise(resolve => setTimeout(resolve, 100));
    }

    console.log('Monitor fermato');
  }

  private async checkStatus(): Promise<void> {
    this.isChecking = true;
    this.abortController = new AbortController();

    try {
      const response = await fetch('https://www.githubstatus.com/api/v2/status.json', {
        signal: this.abortController.signal
      });
      // ... elaborazione
    } catch (error) {
      if (error instanceof Error && error.name === 'AbortError') {
        console.debug('Check cancellato');
        return;
      }
      throw error;
    } finally {
      this.isChecking = false;
      this.abortController = null;
    }
  }
}

// Gestisci graceful shutdown
const monitor = new ManagedStatusMonitor();

process.on('SIGTERM', async () => {
  console.log('Ricevuto SIGTERM, shutdown in corso...');
  await monitor.stop();
  process.exit(0);
});

4. Schema Database Non Ottimizzato per Query Temporali

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- ❌ SBAGLIATO: query lenta su tabella grande
SELECT 
  DATE(created_at) as day,
  COUNT(*) as incident_count
FROM incidents
WHERE created_at >= NOW() - INTERVAL '90 days'
GROUP BY DATE(created_at);
-- Senza indice appropriato, questo fa full table scan

-- ✅ CORRETTO: indice ottimizzato e partizionamento
-- Crea indice specifico per query temporali
CREATE INDEX idx_incidents_created_at_service 
ON incidents (created_at DESC, service_id)
INCLUDE (severity, resolved_at);

-- Per tabelle molto grandi, usa partizionamento
CREATE TABLE incidents (
    id BIGSERIAL,
    service_id INTEGER NOT NULL,
    created_at TIMESTAMPTZ NOT NULL,
    resolved_at TIMESTAMPTZ,
    severity VARCHAR(20),
    description TEXT,
    PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);

-- Crea partizioni mensili
CREATE TABLE incidents_2024_01 PARTITION OF incidents
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
    
CREATE TABLE incidents_2024_02 PARTITION OF incidents
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

-- Query ottimizzata con materialized view per dashboard
CREATE MATERIALIZED VIEW daily_uptime_summary AS
SELECT 
    service_id,
    DATE_TRUNC('day', created_at) as day,
    COUNT(*) as incident_count,
    SUM(
        EXTRACT(EPOCH FROM (COALESCE(resolved_at, NOW()) - created_at)) / 60
    ) as total_downtime_minutes
FROM incidents
GROUP BY service_id, DATE_TRUNC('day', created_at);

-- Refresh periodico (ogni ora)
CREATE INDEX idx_daily_summary ON daily_uptime_summary (service_id, day DESC);

📝 Note: Con PostgreSQL, usa EXPLAIN ANALYZE per verificare che le query usino gli indici. Un dashboard con 90 giorni di dati dovrebbe rispondere in <100ms.

Diagramma di Flusso per Troubleshooting

flowchart TD
    A[Problema Rilevato] --> B{Tipo di Errore?}
    
    B -->|429 Too Many Requests| C[Rate Limiting]
    C --> C1[Verifica Retry-After header]
    C1 --> C2[Implementa backoff esponenziale]
    C2 --> C3[Aggiungi request queue]
    
    B -->|Dati Inconsistenti| D[Timezone Issue]
    D --> D1[Verifica storage in UTC]
    D1 --> D2[Controlla conversioni display]
    D2 --> D3[Aggiungi timezone nei log]
    
    B -->|Memory Crescente| E[Memory Leak]
    E --> E1[Verifica cleanup interval]
    E1 --> E2[Controlla AbortController]
    E2 --> E3[Profila con --inspect]
    
    B -->|Query Lente| F[Database Issue]
    F --> F1[EXPLAIN ANALYZE]
    F1 --> F2{Usa indice?}
    F2 -->|No| F3[Crea indice appropriato]
    F2 -->|Sì ma lento| F4[Considera materialized view]
    
    C3 --> G[✅ Problema Risolto]
    D3 --> G
    E3 --> G
    F3 --> G
    F4 --> G

Conclusioni e Next Steps

Costruire un dashboard di uptime storico non è solo un esercizio tecnico — è un investimento nella comprensione dei tuoi dependency esterni. Dopo mesi di tracking di GitHub e altri servizi, ho imparato che:

I numeri di uptime pubblicati sono ottimistici: Il 99.95% dichiarato raramente corrisponde all’esperienza reale degli utenti. Il tuo sistema di monitoring cattura degradazioni che le status page ufficiali minimizzano.
I pattern si ripetono: Gli incident tendono a concentrarsi in orari specifici (deploy window) e giorni specifici (fine mese per servizi con billing). Questi pattern informano quando schedulare i tuoi deploy critici.
La granularità conta: Polling ogni 5 minuti vs ogni 30 secondi rivela realtà molto diverse. Per servizi critici, il costo computazionale del polling frequente è giustificato.

Roadmap Consigliata

Fase 1 (Settimana 1-2): MVP funzionante

Polling base per 2-3 servizi critici
Storage in SQLite (migra dopo)
Dashboard minimale con uptime percentuale

Fase 2 (Settimana 3-4): Alerting e storicizzazione

Notifiche su degradazioni
Migrazione a PostgreSQL con retention policy
API per integrazioni esterne

Fase 3 (Mese 2): Analytics avanzati

Correlazione tra incident e tuoi deployment
Predizione basata su pattern storici
Report automatici per stakeholder

Fase 4 (Ongoing): Scaling

Multi-region monitoring per latency check
Synthetic monitoring (non solo status API)
Integrazione con incident management interno

Metriche di Successo

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Definisci KPI per il tuo sistema di monitoring
monitoring_kpis:
  detection_latency:
    description: "Tempo tra inizio incident e nostra rilevazione"
    target: "< 2 minuti"
    measurement: "timestamp_detected - incident_started_at"
  
  false_positive_rate:
    description: "Alert che non corrispondono a problemi reali"
    target: "< 5%"
    measurement: "false_alerts / total_alerts * 100"
  
  data_completeness:
    description: "Percentuale di intervalli con dati validi"
    target: "> 99.5%"
    measurement: "valid_datapoints / expected_datapoints * 100"
  
  query_performance:
    description: "P95 latency per query dashboard"
    target: "< 200ms"
    measurement: "percentile(query_duration, 95)"

💡 Tip finale: Non aspettare la perfezione. Lancia il tuo MVP, inizia a raccogliere dati, e itera. I dati storici che non raccogli oggi non potrai mai recuperarli.

Risorse Aggiuntive

Atlassian Statuspage API Documentation - La maggior parte delle status page (inclusa GitHub) usa Statuspage.io. Questa documentazione ti permette di interrogare qualsiasi servizio che lo utilizza
**[OpenTelemetry Collector Configuration

Dashboard Uptime Storico con Pimcore: Guida Completa al Monitoring

Building Your Own Historical Uptime Dashboard: Lessons from Tracking GitHub’s Availability

Prerequisiti

Architettura e Concetti Chiave

Implementazione Passo-Passo

Configurazione del Data Model in Pimcore

Sistema di Scraping con Rate Limiting Intelligente

Storage Time-Series con SQLite e Aggregazioni Automatiche

Configurazione per Produzione

Docker Compose per Produzione

API REST con Caching Intelligente

Architettura del Sistema di Produzione

Errori Comuni e Troubleshooting

1. Rate Limiting delle API Esterne

2. Drift Temporale nei Check

3. Problemi di Connessione al Database

Performance e Scalabilità

Indici Ottimizzati per Query Frequenti

Benchmark e Metriche

Conclusioni e Next Steps

Risorse Aggiuntive

Errori Comuni e Troubleshooting

1. Rate Limiting Non Gestito

2. Timezone Hell nei Calcoli di Uptime

3. Memory Leak nel Polling Continuo

4. Schema Database Non Ottimizzato per Query Temporali

Diagramma di Flusso per Troubleshooting

Conclusioni e Next Steps

Roadmap Consigliata

Metriche di Successo

Risorse Aggiuntive