ADR-0002: Event-Driven Metrics Platform Architecture¶

Status¶

Drafted - 2025-08-07

Context¶

The AgentHub platform consists of multiple distributed services generating business events that need to be captured, processed, and analyzed for operational insights and business intelligence. Currently, there is no unified approach to metrics collection, leading to:

Inconsistent Analytics: Different services use different logging and metrics approaches
Limited Visibility: No centralized view of business operations across domains
Schema Coupling: Reports directly access databases, limiting schema evolution
Cost Concerns: Need for cost-effective solution that can scale from 1K to 100K events/day
Multiple Data Sources: Need to combine business events with frontend analytics (Matomo) and error tracking (Sentry)

Decision¶

We will implement an Event-Driven Metrics Platform using Azure cloud services with the following architecture:

Core Architecture Decisions¶

Event Hub for Ingestion: Use Azure Event Hubs with automatic capture to Azure Data Lake
Data Lake for Storage: Use Azure Data Lake Gen2 with medallion architecture (Bronze/Silver/Gold)
Synapse Serverless for Processing: Use Azure Synapse Serverless SQL for transformations
Data Factory for Orchestration: Use Azure Data Factory for pipeline automation
Power BI for Analytics: Use Power BI with Direct Lake mode for reporting
Schema Contracts: Implement event schema contracts to decouple data consumers from producers

Technology Stack¶

Layer	Technology	Justification
Ingestion	Azure Event Hubs (Standard)	Managed service, auto-capture, cost-effective at scale
Storage	Azure Data Lake Gen2	10x cheaper than SQL databases, schema flexibility
Processing	Synapse Serverless SQL	Pay-per-query, no idle costs, SQL compatibility
Orchestration	Azure Data Factory	Managed orchestration, visual pipelines, cost-effective
Analytics	Power BI Pro/Premium	Direct Lake mode, unified reporting, existing licenses
Format	Delta Lake/Parquet	ACID transactions, time travel, optimized for analytics

Rationale¶

Why Event Hubs over Other Options?¶

Considered Alternatives: - Service Bus: More expensive, over-engineered for our use case - Kafka on VMs: Higher operational overhead, more expensive - Storage Queue: Limited throughput, no capture functionality

Decision: Event Hubs - Auto-capture eliminates custom ingestion code - Built-in partitioning for parallel processing - Cost-effective at our scale (1K-100K events/day) - Seamless integration with Azure Data Lake

Why Azure Data Lake Gen2 over SQL Database?¶

Considered Alternatives: - Azure SQL Database: 10x more expensive, schema rigidity - Cosmos DB: Expensive for analytics workloads, limited SQL support - Table Storage: No ACID transactions, limited query capabilities

Decision: Data Lake Gen2 - Cost: $0.0208/GB vs $5-15/DTU for SQL Database - Schema Flexibility: Schema-on-read for evolving event structures
- Scale: Petabyte scale without performance degradation - Analytics: Optimized for large-scale analytical queries

Why Synapse Serverless over Dedicated Pools?¶

Considered Alternatives: - Synapse Dedicated Pools: $1,500+/month minimum cost - Azure SQL Database: Fixed costs, limited analytics capabilities - Databricks: More expensive, over-engineered for our use case

Decision: Synapse Serverless - Pay-per-Query: Only pay for data scanned ($5/TB) - No Minimum Cost: Perfect for variable workloads - SQL Compatibility: Familiar query language for the team - Direct Lake: Native integration with Power BI

Why Schema Contracts over Direct Database Access?¶

Problem with Direct Database Access: - Tight coupling between services and reporting - Schema changes break reports - Limited ability to evolve data models - Performance impact on operational databases

Benefits of Schema Contracts: - Decoupling: Services can evolve independently - Versioning: Support multiple schema versions - Performance: Analytics don't impact operational systems - Governance: Clear data contracts and ownership

Why Batch over Real-Time Processing?¶

Considered Alternatives: - Stream Analytics: $100+/month per streaming unit - Event Hubs with Functions: Complex error handling, higher costs - Real-time processing: Higher complexity and costs

Decision: Batch Processing - Cost: 80% cheaper than streaming solutions - Simplicity: Easier to develop, test, and maintain - Business Need: 1-hour latency acceptable for business metrics - Reliability: Better error handling and retry mechanisms

Architecture Diagrams¶

High-Level Architecture¶

graph TB
    subgraph "Event Sources"
        S1[CRM Service]
        S2[Matching Service]
        S3[Profile Service]
        FE[Angular Frontend]
        MA[Matomo Cloud]
        SE[Sentry]
    end

    subgraph "Azure Event Ingestion"
        EH1[Event Hub: CRM]
        EH2[Event Hub: Matching]
        EH3[Event Hub: Profile]
        EHC[Event Hub Capture]
    end

    subgraph "Azure Data Lake Gen2"
        RAW[Raw Layer<br/>Avro Files]
        BRONZE[Bronze Layer<br/>Parquet Files]
        SILVER[Silver Layer<br/>Delta Tables]
        GOLD[Gold Layer<br/>Business Metrics]
    end

    subgraph "Processing"
        ADF[Data Factory<br/>Orchestration]
        SYN[Synapse Serverless<br/>SQL Processing]
    end

    subgraph "Analytics"
        PBI[Power BI<br/>Dashboards]
        API1[Matomo API]
        API2[Sentry API]
    end

    S1 --> EH1
    S2 --> EH2
    S3 --> EH3

    EH1 --> EHC
    EH2 --> EHC
    EH3 --> EHC

    EHC --> RAW
    RAW --> BRONZE
    BRONZE --> SILVER
    SILVER --> GOLD

    ADF --> SYN
    SYN --> SILVER
    SYN --> GOLD

    GOLD --> PBI
    MA --> API1
    SE --> API2
    API1 --> PBI
    API2 --> PBI
    FE --> MA

Data Flow Architecture¶

sequenceDiagram
    participant Service
    participant EventHub
    participant Capture
    participant DataLake
    participant DataFactory
    participant Synapse
    participant PowerBI

    Service->>EventHub: Publish Business Event
    EventHub->>Capture: Auto-capture (5min)
    Capture->>DataLake: Store Raw Events

    Note over DataFactory: Hourly Schedule
    DataFactory->>DataLake: Read Raw Events
    DataFactory->>Synapse: Execute Transformations
    Synapse->>DataLake: Write Processed Data

    PowerBI->>DataLake: Query Gold Layer
    PowerBI->>PowerBI: Refresh Dashboard

Consequences¶

Positive Consequences¶

Cost Efficiency
Target: <$400/month for complete solution
83% cost reduction compared to real-time alternatives
Pay-per-use model scales with business growth
Schema Independence
Services can evolve schemas without breaking reports
Centralized schema governance and versioning
Reduced coupling between operational and analytical systems
Unified Analytics
Single source of truth for business metrics
Combined view of business events, web analytics, and errors
Consistent reporting across all domains
Scalability
Handle 100x growth (1K to 100K events/day) without redesign
Auto-scaling Event Hubs and serverless processing
Petabyte-scale storage capability
Developer Productivity
Familiar SQL interface for transformations
Visual pipeline development with Data Factory
Self-service analytics with Power BI

Negative Consequences¶

Latency
1-hour latency for business metrics (acceptable for current needs)
Not suitable for real-time alerting or operational dashboards
May require separate solution for real-time use cases
Complexity
Multiple Azure services to manage and monitor
Learning curve for team on new technologies
More complex than direct database reporting
Vendor Lock-in
Heavy dependency on Azure services
Migration to other platforms would be significant effort
Risk of Azure pricing changes affecting costs
Query Costs
Synapse charges per TB scanned
Poorly optimized queries can increase costs
Need for query optimization and monitoring

Monitoring and Success Metrics¶

Technical KPIs¶

Availability: >99.5% system uptime
Data Freshness: <2 hours from event to dashboard
Processing Success: >99% pipeline success rate
Query Performance: <5 seconds for dashboard refresh

Business KPIs¶

Cost Efficiency: <$400/month total infrastructure cost
User Adoption: >80% of stakeholders using dashboards weekly
Decision Speed: 50% reduction in time to get business insights
Data Quality: <1% of events failing validation

Cost Tracking¶

Monthly cost monitoring with budget alerts
Cost per event tracking to ensure scalability
Optimization opportunities identification and implementation

Future Considerations¶

When to Revisit This Decision¶

Scale Changes: If event volume exceeds 1M events/day
Latency Requirements: If real-time analytics becomes critical
Cost Changes: If Azure pricing changes significantly
Technology Evolution: If new cost-effective alternatives emerge
Business Changes: If reporting requirements change dramatically

Potential Improvements¶

Real-time Layer: Add Azure Stream Analytics for critical metrics
Machine Learning: Implement predictive analytics with Azure ML
Data Mesh: Evolve to domain-owned data products
Multi-cloud: Consider hybrid/multi-cloud for vendor diversification

Decision Made By: Architecture Team
Date: 2024-01-15
Review Date: 2024-07-15
Status: Accepted and In Implementation