Skip to content

ADR-0002: Event-Driven Metrics Platform Architecture

Status

Drafted - 2025-08-07

Context

The AgentHub platform consists of multiple distributed services generating business events that need to be captured, processed, and analyzed for operational insights and business intelligence. Currently, there is no unified approach to metrics collection, leading to:

  • Inconsistent Analytics: Different services use different logging and metrics approaches
  • Limited Visibility: No centralized view of business operations across domains
  • Schema Coupling: Reports directly access databases, limiting schema evolution
  • Cost Concerns: Need for cost-effective solution that can scale from 1K to 100K events/day
  • Multiple Data Sources: Need to combine business events with frontend analytics (Matomo) and error tracking (Sentry)

Decision

We will implement an Event-Driven Metrics Platform using Azure cloud services with the following architecture:

Core Architecture Decisions

  1. Event Hub for Ingestion: Use Azure Event Hubs with automatic capture to Azure Data Lake
  2. Data Lake for Storage: Use Azure Data Lake Gen2 with medallion architecture (Bronze/Silver/Gold)
  3. Synapse Serverless for Processing: Use Azure Synapse Serverless SQL for transformations
  4. Data Factory for Orchestration: Use Azure Data Factory for pipeline automation
  5. Power BI for Analytics: Use Power BI with Direct Lake mode for reporting
  6. Schema Contracts: Implement event schema contracts to decouple data consumers from producers

Technology Stack

Layer Technology Justification
Ingestion Azure Event Hubs (Standard) Managed service, auto-capture, cost-effective at scale
Storage Azure Data Lake Gen2 10x cheaper than SQL databases, schema flexibility
Processing Synapse Serverless SQL Pay-per-query, no idle costs, SQL compatibility
Orchestration Azure Data Factory Managed orchestration, visual pipelines, cost-effective
Analytics Power BI Pro/Premium Direct Lake mode, unified reporting, existing licenses
Format Delta Lake/Parquet ACID transactions, time travel, optimized for analytics

Rationale

Why Event Hubs over Other Options?

Considered Alternatives: - Service Bus: More expensive, over-engineered for our use case - Kafka on VMs: Higher operational overhead, more expensive - Storage Queue: Limited throughput, no capture functionality

Decision: Event Hubs - Auto-capture eliminates custom ingestion code - Built-in partitioning for parallel processing - Cost-effective at our scale (1K-100K events/day) - Seamless integration with Azure Data Lake

Why Azure Data Lake Gen2 over SQL Database?

Considered Alternatives: - Azure SQL Database: 10x more expensive, schema rigidity - Cosmos DB: Expensive for analytics workloads, limited SQL support - Table Storage: No ACID transactions, limited query capabilities

Decision: Data Lake Gen2 - Cost: $0.0208/GB vs $5-15/DTU for SQL Database - Schema Flexibility: Schema-on-read for evolving event structures
- Scale: Petabyte scale without performance degradation - Analytics: Optimized for large-scale analytical queries

Why Synapse Serverless over Dedicated Pools?

Considered Alternatives: - Synapse Dedicated Pools: $1,500+/month minimum cost - Azure SQL Database: Fixed costs, limited analytics capabilities - Databricks: More expensive, over-engineered for our use case

Decision: Synapse Serverless - Pay-per-Query: Only pay for data scanned ($5/TB) - No Minimum Cost: Perfect for variable workloads - SQL Compatibility: Familiar query language for the team - Direct Lake: Native integration with Power BI

Why Schema Contracts over Direct Database Access?

Problem with Direct Database Access: - Tight coupling between services and reporting - Schema changes break reports - Limited ability to evolve data models - Performance impact on operational databases

Benefits of Schema Contracts: - Decoupling: Services can evolve independently - Versioning: Support multiple schema versions - Performance: Analytics don't impact operational systems - Governance: Clear data contracts and ownership

Why Batch over Real-Time Processing?

Considered Alternatives: - Stream Analytics: $100+/month per streaming unit - Event Hubs with Functions: Complex error handling, higher costs - Real-time processing: Higher complexity and costs

Decision: Batch Processing - Cost: 80% cheaper than streaming solutions - Simplicity: Easier to develop, test, and maintain - Business Need: 1-hour latency acceptable for business metrics - Reliability: Better error handling and retry mechanisms

Architecture Diagrams

High-Level Architecture

graph TB
    subgraph "Event Sources"
        S1[CRM Service]
        S2[Matching Service]
        S3[Profile Service]
        FE[Angular Frontend]
        MA[Matomo Cloud]
        SE[Sentry]
    end

    subgraph "Azure Event Ingestion"
        EH1[Event Hub: CRM]
        EH2[Event Hub: Matching]
        EH3[Event Hub: Profile]
        EHC[Event Hub Capture]
    end

    subgraph "Azure Data Lake Gen2"
        RAW[Raw Layer<br/>Avro Files]
        BRONZE[Bronze Layer<br/>Parquet Files]
        SILVER[Silver Layer<br/>Delta Tables]
        GOLD[Gold Layer<br/>Business Metrics]
    end

    subgraph "Processing"
        ADF[Data Factory<br/>Orchestration]
        SYN[Synapse Serverless<br/>SQL Processing]
    end

    subgraph "Analytics"
        PBI[Power BI<br/>Dashboards]
        API1[Matomo API]
        API2[Sentry API]
    end

    S1 --> EH1
    S2 --> EH2
    S3 --> EH3

    EH1 --> EHC
    EH2 --> EHC
    EH3 --> EHC

    EHC --> RAW
    RAW --> BRONZE
    BRONZE --> SILVER
    SILVER --> GOLD

    ADF --> SYN
    SYN --> SILVER
    SYN --> GOLD

    GOLD --> PBI
    MA --> API1
    SE --> API2
    API1 --> PBI
    API2 --> PBI
    FE --> MA

Data Flow Architecture

sequenceDiagram
    participant Service
    participant EventHub
    participant Capture
    participant DataLake
    participant DataFactory
    participant Synapse
    participant PowerBI

    Service->>EventHub: Publish Business Event
    EventHub->>Capture: Auto-capture (5min)
    Capture->>DataLake: Store Raw Events

    Note over DataFactory: Hourly Schedule
    DataFactory->>DataLake: Read Raw Events
    DataFactory->>Synapse: Execute Transformations
    Synapse->>DataLake: Write Processed Data

    PowerBI->>DataLake: Query Gold Layer
    PowerBI->>PowerBI: Refresh Dashboard

Consequences

Positive Consequences

  1. Cost Efficiency
  2. Target: <$400/month for complete solution
  3. 83% cost reduction compared to real-time alternatives
  4. Pay-per-use model scales with business growth

  5. Schema Independence

  6. Services can evolve schemas without breaking reports
  7. Centralized schema governance and versioning
  8. Reduced coupling between operational and analytical systems

  9. Unified Analytics

  10. Single source of truth for business metrics
  11. Combined view of business events, web analytics, and errors
  12. Consistent reporting across all domains

  13. Scalability

  14. Handle 100x growth (1K to 100K events/day) without redesign
  15. Auto-scaling Event Hubs and serverless processing
  16. Petabyte-scale storage capability

  17. Developer Productivity

  18. Familiar SQL interface for transformations
  19. Visual pipeline development with Data Factory
  20. Self-service analytics with Power BI

Negative Consequences

  1. Latency
  2. 1-hour latency for business metrics (acceptable for current needs)
  3. Not suitable for real-time alerting or operational dashboards
  4. May require separate solution for real-time use cases

  5. Complexity

  6. Multiple Azure services to manage and monitor
  7. Learning curve for team on new technologies
  8. More complex than direct database reporting

  9. Vendor Lock-in

  10. Heavy dependency on Azure services
  11. Migration to other platforms would be significant effort
  12. Risk of Azure pricing changes affecting costs

  13. Query Costs

  14. Synapse charges per TB scanned
  15. Poorly optimized queries can increase costs
  16. Need for query optimization and monitoring

Monitoring and Success Metrics

Technical KPIs

  • Availability: >99.5% system uptime
  • Data Freshness: <2 hours from event to dashboard
  • Processing Success: >99% pipeline success rate
  • Query Performance: <5 seconds for dashboard refresh

Business KPIs

  • Cost Efficiency: <$400/month total infrastructure cost
  • User Adoption: >80% of stakeholders using dashboards weekly
  • Decision Speed: 50% reduction in time to get business insights
  • Data Quality: <1% of events failing validation

Cost Tracking

  • Monthly cost monitoring with budget alerts
  • Cost per event tracking to ensure scalability
  • Optimization opportunities identification and implementation

Future Considerations

When to Revisit This Decision

  1. Scale Changes: If event volume exceeds 1M events/day
  2. Latency Requirements: If real-time analytics becomes critical
  3. Cost Changes: If Azure pricing changes significantly
  4. Technology Evolution: If new cost-effective alternatives emerge
  5. Business Changes: If reporting requirements change dramatically

Potential Improvements

  1. Real-time Layer: Add Azure Stream Analytics for critical metrics
  2. Machine Learning: Implement predictive analytics with Azure ML
  3. Data Mesh: Evolve to domain-owned data products
  4. Multi-cloud: Consider hybrid/multi-cloud for vendor diversification

Decision Made By: Architecture Team
Date: 2024-01-15
Review Date: 2024-07-15
Status: Accepted and In Implementation