ADR-0002: Event-Driven Metrics Platform Architecture¶
Status¶
Drafted - 2025-08-07
Context¶
The AgentHub platform consists of multiple distributed services generating business events that need to be captured, processed, and analyzed for operational insights and business intelligence. Currently, there is no unified approach to metrics collection, leading to:
- Inconsistent Analytics: Different services use different logging and metrics approaches
- Limited Visibility: No centralized view of business operations across domains
- Schema Coupling: Reports directly access databases, limiting schema evolution
- Cost Concerns: Need for cost-effective solution that can scale from 1K to 100K events/day
- Multiple Data Sources: Need to combine business events with frontend analytics (Matomo) and error tracking (Sentry)
Decision¶
We will implement an Event-Driven Metrics Platform using Azure cloud services with the following architecture:
Core Architecture Decisions¶
- Event Hub for Ingestion: Use Azure Event Hubs with automatic capture to Azure Data Lake
- Data Lake for Storage: Use Azure Data Lake Gen2 with medallion architecture (Bronze/Silver/Gold)
- Synapse Serverless for Processing: Use Azure Synapse Serverless SQL for transformations
- Data Factory for Orchestration: Use Azure Data Factory for pipeline automation
- Power BI for Analytics: Use Power BI with Direct Lake mode for reporting
- Schema Contracts: Implement event schema contracts to decouple data consumers from producers
Technology Stack¶
| Layer | Technology | Justification |
|---|---|---|
| Ingestion | Azure Event Hubs (Standard) | Managed service, auto-capture, cost-effective at scale |
| Storage | Azure Data Lake Gen2 | 10x cheaper than SQL databases, schema flexibility |
| Processing | Synapse Serverless SQL | Pay-per-query, no idle costs, SQL compatibility |
| Orchestration | Azure Data Factory | Managed orchestration, visual pipelines, cost-effective |
| Analytics | Power BI Pro/Premium | Direct Lake mode, unified reporting, existing licenses |
| Format | Delta Lake/Parquet | ACID transactions, time travel, optimized for analytics |
Rationale¶
Why Event Hubs over Other Options?¶
Considered Alternatives: - Service Bus: More expensive, over-engineered for our use case - Kafka on VMs: Higher operational overhead, more expensive - Storage Queue: Limited throughput, no capture functionality
Decision: Event Hubs - Auto-capture eliminates custom ingestion code - Built-in partitioning for parallel processing - Cost-effective at our scale (1K-100K events/day) - Seamless integration with Azure Data Lake
Why Azure Data Lake Gen2 over SQL Database?¶
Considered Alternatives: - Azure SQL Database: 10x more expensive, schema rigidity - Cosmos DB: Expensive for analytics workloads, limited SQL support - Table Storage: No ACID transactions, limited query capabilities
Decision: Data Lake Gen2
- Cost: $0.0208/GB vs $5-15/DTU for SQL Database
- Schema Flexibility: Schema-on-read for evolving event structures
- Scale: Petabyte scale without performance degradation
- Analytics: Optimized for large-scale analytical queries
Why Synapse Serverless over Dedicated Pools?¶
Considered Alternatives: - Synapse Dedicated Pools: $1,500+/month minimum cost - Azure SQL Database: Fixed costs, limited analytics capabilities - Databricks: More expensive, over-engineered for our use case
Decision: Synapse Serverless - Pay-per-Query: Only pay for data scanned ($5/TB) - No Minimum Cost: Perfect for variable workloads - SQL Compatibility: Familiar query language for the team - Direct Lake: Native integration with Power BI
Why Schema Contracts over Direct Database Access?¶
Problem with Direct Database Access: - Tight coupling between services and reporting - Schema changes break reports - Limited ability to evolve data models - Performance impact on operational databases
Benefits of Schema Contracts: - Decoupling: Services can evolve independently - Versioning: Support multiple schema versions - Performance: Analytics don't impact operational systems - Governance: Clear data contracts and ownership
Why Batch over Real-Time Processing?¶
Considered Alternatives: - Stream Analytics: $100+/month per streaming unit - Event Hubs with Functions: Complex error handling, higher costs - Real-time processing: Higher complexity and costs
Decision: Batch Processing - Cost: 80% cheaper than streaming solutions - Simplicity: Easier to develop, test, and maintain - Business Need: 1-hour latency acceptable for business metrics - Reliability: Better error handling and retry mechanisms
Architecture Diagrams¶
High-Level Architecture¶
graph TB
subgraph "Event Sources"
S1[CRM Service]
S2[Matching Service]
S3[Profile Service]
FE[Angular Frontend]
MA[Matomo Cloud]
SE[Sentry]
end
subgraph "Azure Event Ingestion"
EH1[Event Hub: CRM]
EH2[Event Hub: Matching]
EH3[Event Hub: Profile]
EHC[Event Hub Capture]
end
subgraph "Azure Data Lake Gen2"
RAW[Raw Layer<br/>Avro Files]
BRONZE[Bronze Layer<br/>Parquet Files]
SILVER[Silver Layer<br/>Delta Tables]
GOLD[Gold Layer<br/>Business Metrics]
end
subgraph "Processing"
ADF[Data Factory<br/>Orchestration]
SYN[Synapse Serverless<br/>SQL Processing]
end
subgraph "Analytics"
PBI[Power BI<br/>Dashboards]
API1[Matomo API]
API2[Sentry API]
end
S1 --> EH1
S2 --> EH2
S3 --> EH3
EH1 --> EHC
EH2 --> EHC
EH3 --> EHC
EHC --> RAW
RAW --> BRONZE
BRONZE --> SILVER
SILVER --> GOLD
ADF --> SYN
SYN --> SILVER
SYN --> GOLD
GOLD --> PBI
MA --> API1
SE --> API2
API1 --> PBI
API2 --> PBI
FE --> MA
Data Flow Architecture¶
sequenceDiagram
participant Service
participant EventHub
participant Capture
participant DataLake
participant DataFactory
participant Synapse
participant PowerBI
Service->>EventHub: Publish Business Event
EventHub->>Capture: Auto-capture (5min)
Capture->>DataLake: Store Raw Events
Note over DataFactory: Hourly Schedule
DataFactory->>DataLake: Read Raw Events
DataFactory->>Synapse: Execute Transformations
Synapse->>DataLake: Write Processed Data
PowerBI->>DataLake: Query Gold Layer
PowerBI->>PowerBI: Refresh Dashboard
Consequences¶
Positive Consequences¶
- Cost Efficiency
- Target: <$400/month for complete solution
- 83% cost reduction compared to real-time alternatives
-
Pay-per-use model scales with business growth
-
Schema Independence
- Services can evolve schemas without breaking reports
- Centralized schema governance and versioning
-
Reduced coupling between operational and analytical systems
-
Unified Analytics
- Single source of truth for business metrics
- Combined view of business events, web analytics, and errors
-
Consistent reporting across all domains
-
Scalability
- Handle 100x growth (1K to 100K events/day) without redesign
- Auto-scaling Event Hubs and serverless processing
-
Petabyte-scale storage capability
-
Developer Productivity
- Familiar SQL interface for transformations
- Visual pipeline development with Data Factory
- Self-service analytics with Power BI
Negative Consequences¶
- Latency
- 1-hour latency for business metrics (acceptable for current needs)
- Not suitable for real-time alerting or operational dashboards
-
May require separate solution for real-time use cases
-
Complexity
- Multiple Azure services to manage and monitor
- Learning curve for team on new technologies
-
More complex than direct database reporting
-
Vendor Lock-in
- Heavy dependency on Azure services
- Migration to other platforms would be significant effort
-
Risk of Azure pricing changes affecting costs
-
Query Costs
- Synapse charges per TB scanned
- Poorly optimized queries can increase costs
- Need for query optimization and monitoring
Monitoring and Success Metrics¶
Technical KPIs¶
- Availability: >99.5% system uptime
- Data Freshness: <2 hours from event to dashboard
- Processing Success: >99% pipeline success rate
- Query Performance: <5 seconds for dashboard refresh
Business KPIs¶
- Cost Efficiency: <$400/month total infrastructure cost
- User Adoption: >80% of stakeholders using dashboards weekly
- Decision Speed: 50% reduction in time to get business insights
- Data Quality: <1% of events failing validation
Cost Tracking¶
- Monthly cost monitoring with budget alerts
- Cost per event tracking to ensure scalability
- Optimization opportunities identification and implementation
Future Considerations¶
When to Revisit This Decision¶
- Scale Changes: If event volume exceeds 1M events/day
- Latency Requirements: If real-time analytics becomes critical
- Cost Changes: If Azure pricing changes significantly
- Technology Evolution: If new cost-effective alternatives emerge
- Business Changes: If reporting requirements change dramatically
Potential Improvements¶
- Real-time Layer: Add Azure Stream Analytics for critical metrics
- Machine Learning: Implement predictive analytics with Azure ML
- Data Mesh: Evolve to domain-owned data products
- Multi-cloud: Consider hybrid/multi-cloud for vendor diversification
Related Decisions¶
- ADR-0001: Record Architecture Decisions
- Matomo Integration Architecture
- Business Events Documentation
Decision Made By: Architecture Team
Date: 2024-01-15
Review Date: 2024-07-15
Status: Accepted and In Implementation