Skip to content

Azure Data Lake Integration

Overview

The Azure Data Lake integration approach leverages Azure's data platform to create a scalable, cost-effective solution for storing and analyzing large volumes of Matomo analytics data. This approach is ideal for organizations that need to combine Matomo data with other data sources and perform complex analytics.

Architecture

graph TB
    subgraph "Matomo Cloud"
        MA[Matomo Analytics]
        MAPI[Reporting API]
    end

    subgraph "Data Extraction Layer"
        AF[Azure Function<br/>Timer Triggered]
        AKV[Azure Key Vault<br/>Secrets]
    end

    subgraph "Storage Layer"
        subgraph "Azure Data Lake Gen2"
            BR[Bronze<br/>Raw JSON]
            SI[Silver<br/>Cleaned Parquet]
            GO[Gold<br/>Aggregated Data]
        end
    end

    subgraph "Processing Layer"
        SYN[Azure Synapse<br/>Analytics]
        SP[Spark Pools<br/>Optional]
        SQL[Serverless SQL]
    end

    subgraph "Serving Layer"
        PBI[Power BI Service]
        DASH[Dashboards]
    end

    MA --> MAPI
    MAPI -->|HTTPS/JSON| AF
    AKV -->|Credentials| AF
    AF -->|Raw Data| BR
    BR -->|Transform| SI
    SI -->|Aggregate| GO
    BR --> SQL
    SI --> SQL
    GO --> SQL
    SQL --> PBI
    SP --> SI
    SP --> GO
    PBI --> DASH

    style MA fill:#f9f,stroke:#333,stroke-width:2px
    style DASH fill:#9f9,stroke:#333,stroke-width:2px
    style BR fill:#ffd,stroke:#333,stroke-width:2px
    style SI fill:#dfd,stroke:#333,stroke-width:2px
    style GO fill:#dff,stroke:#333,stroke-width:2px

Medallion Architecture

graph LR
    subgraph "Data Lake Zones"
        subgraph "Bronze Layer"
            B1[Raw JSON Files]
            B2[Original Format]
            B3[Immutable]
        end

        subgraph "Silver Layer"
            S1[Cleaned Data]
            S2[Parquet Format]
            S3[Deduplicated]
            S4[Validated]
        end

        subgraph "Gold Layer"
            G1[Business Ready]
            G2[Aggregated]
            G3[Optimized]
            G4[Star Schema]
        end
    end

    B1 --> S1
    S1 --> G1

    style B1 fill:#ffd,stroke:#333,stroke-width:2px
    style S1 fill:#dfd,stroke:#333,stroke-width:2px
    style G1 fill:#dff,stroke:#333,stroke-width:2px

Implementation Steps

Step 1: Set Up Azure Resources

# Variables
RESOURCE_GROUP="rg-matomo-analytics"
LOCATION="westeurope"
STORAGE_ACCOUNT="stmatomoanalytics"
SYNAPSE_WORKSPACE="syn-matomo-analytics"
FUNCTION_APP="func-matomo-extractor"
KEY_VAULT="kv-matomo-secrets"

# Create Resource Group
az group create --name $RESOURCE_GROUP --location $LOCATION

# Create Storage Account with Data Lake Gen2
az storage account create \
    --name $STORAGE_ACCOUNT \
    --resource-group $RESOURCE_GROUP \
    --location $LOCATION \
    --sku Standard_LRS \
    --kind StorageV2 \
    --hierarchical-namespace true

# Create containers for medallion architecture
az storage container create \
    --name bronze \
    --account-name $STORAGE_ACCOUNT

az storage container create \
    --name silver \
    --account-name $STORAGE_ACCOUNT

az storage container create \
    --name gold \
    --account-name $STORAGE_ACCOUNT

# Create Synapse Workspace
az synapse workspace create \
    --name $SYNAPSE_WORKSPACE \
    --resource-group $RESOURCE_GROUP \
    --storage-account $STORAGE_ACCOUNT \
    --file-system workspace \
    --sql-admin-login-user sqladmin \
    --sql-admin-login-password "ComplexPassword123!"

# Create Key Vault
az keyvault create \
    --name $KEY_VAULT \
    --resource-group $RESOURCE_GROUP \
    --location $LOCATION

# Store Matomo API Token
az keyvault secret set \
    --vault-name $KEY_VAULT \
    --name "MatomoApiToken" \
    --value "your-matomo-api-token"

Step 2: Create Azure Function for Data Extraction

using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.Logging;
using Newtonsoft.Json;
using Azure.Storage.Blobs;
using Azure.Security.KeyVault.Secrets;
using Azure.Identity;
using System.Net.Http;

public static class MatomoDataExtractor
{
    private static readonly HttpClient httpClient = new HttpClient();

    [FunctionName("ExtractMatomoData")]
    public static async Task Run(
        [TimerTrigger("0 0 */1 * * *")] TimerInfo myTimer,
        ILogger log)
    {
        log.LogInformation($"Matomo data extraction started at: {DateTime.Now}");

        // Get configuration
        var keyVaultUrl = Environment.GetEnvironmentVariable("KeyVaultUrl");
        var storageConnectionString = Environment.GetEnvironmentVariable("StorageConnectionString");
        var matomoBaseUrl = Environment.GetEnvironmentVariable("MatomoBaseUrl");
        var siteId = Environment.GetEnvironmentVariable("MatomoSiteId");

        // Retrieve API token from Key Vault
        var client = new SecretClient(new Uri(keyVaultUrl), new DefaultAzureCredential());
        var apiToken = await client.GetSecretAsync("MatomoApiToken");

        // Define reports to extract
        var reports = new[]
        {
            "VisitsSummary.get",
            "Actions.getPageUrls",
            "UserCountry.getCountry",
            "DevicesDetection.getType",
            "Referrers.getReferrerType"
        };

        // Extract data for each report
        foreach (var report in reports)
        {
            await ExtractReport(
                matomoBaseUrl,
                siteId,
                apiToken.Value.Value,
                report,
                storageConnectionString,
                log
            );
        }

        log.LogInformation($"Matomo data extraction completed at: {DateTime.Now}");
    }

    private static async Task ExtractReport(
        string baseUrl,
        string siteId,
        string apiToken,
        string reportMethod,
        string storageConnectionString,
        ILogger log)
    {
        try
        {
            // Build API URL
            var apiUrl = $"{baseUrl}/index.php" +
                $"?module=API" +
                $"&method={reportMethod}" +
                $"&idSite={siteId}" +
                $"&period=day" +
                $"&date=yesterday" +
                $"&format=JSON" +
                $"&token_auth={apiToken}";

            // Fetch data from Matomo
            var response = await httpClient.GetStringAsync(apiUrl);

            // Save to Bronze layer in Data Lake
            var blobServiceClient = new BlobServiceClient(storageConnectionString);
            var containerClient = blobServiceClient.GetBlobContainerClient("bronze");

            var timestamp = DateTime.UtcNow.ToString("yyyyMMddHHmmss");
            var blobName = $"matomo/{reportMethod}/{DateTime.UtcNow:yyyy/MM/dd}/{timestamp}.json";

            var blobClient = containerClient.GetBlobClient(blobName);

            using (var stream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(response)))
            {
                await blobClient.UploadAsync(stream, overwrite: true);
            }

            log.LogInformation($"Successfully extracted {reportMethod} to {blobName}");
        }
        catch (Exception ex)
        {
            log.LogError($"Error extracting {reportMethod}: {ex.Message}");
            throw;
        }
    }
}

Step 3: Create Synapse Pipeline for Data Transformation

{
    "name": "MatomoDataTransformation",
    "properties": {
        "activities": [
            {
                "name": "BronzeToSilver",
                "type": "DataFlow",
                "policy": {
                    "timeout": "1.00:00:00",
                    "retry": 2
                },
                "typeProperties": {
                    "dataflow": {
                        "referenceName": "TransformMatomoData",
                        "type": "DataFlowReference"
                    }
                }
            },
            {
                "name": "SilverToGold",
                "type": "SparkJobDefinition",
                "dependsOn": [
                    {
                        "activity": "BronzeToSilver",
                        "dependencyConditions": ["Succeeded"]
                    }
                ],
                "policy": {
                    "timeout": "0.12:00:00",
                    "retry": 1
                },
                "typeProperties": {
                    "sparkJob": {
                        "referenceName": "AggregateMatomoData",
                        "type": "SparkJobDefinitionReference"
                    }
                }
            }
        ],
        "annotations": ["Matomo", "ETL", "DataLake"]
    }
}

Step 4: Create Serverless SQL Views

-- Create external data source for Data Lake
CREATE EXTERNAL DATA SOURCE MatomoDataLake
WITH (
    LOCATION = 'https://stmatomoanalytics.dfs.core.windows.net',
    CREDENTIAL = ManagedIdentityCredential
);

-- Create external file format for Parquet
CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);

-- Create schema for Matomo data
CREATE SCHEMA matomo;
GO

-- Create view for visits data (Silver layer)
CREATE VIEW matomo.visits AS
SELECT
    date,
    nb_visits,
    nb_uniq_visitors,
    nb_pageviews,
    avg_time_on_site,
    bounce_rate,
    nb_conversions,
    conversion_rate
FROM
    OPENROWSET(
        BULK 'silver/matomo/visits/*/*.parquet',
        DATA_SOURCE = 'MatomoDataLake',
        FORMAT = 'PARQUET'
    ) AS [result];

-- Create view for page analytics (Gold layer)
CREATE VIEW matomo.page_analytics AS
SELECT
    date,
    page_url,
    page_title,
    entry_nb_visits,
    entry_bounce_count,
    exit_nb_visits,
    avg_time_on_page,
    bounce_rate,
    exit_rate
FROM
    OPENROWSET(
        BULK 'gold/matomo/page_analytics/*/*.parquet',
        DATA_SOURCE = 'MatomoDataLake',
        FORMAT = 'PARQUET'
    ) AS [result];

-- Create materialized view for dashboard performance
CREATE MATERIALIZED VIEW matomo.daily_summary
WITH (DISTRIBUTION = HASH(date))
AS
SELECT
    date,
    SUM(nb_visits) as total_visits,
    SUM(nb_uniq_visitors) as unique_visitors,
    SUM(nb_pageviews) as pageviews,
    AVG(avg_time_on_site) as avg_session_duration,
    AVG(bounce_rate) as avg_bounce_rate
FROM matomo.visits
GROUP BY date;

Step 5: Configure Power BI Connection

graph TD
    subgraph "Power BI Configuration"
        DS[Data Source Settings]
        CRED[Azure AD Authentication]
        DQ[DirectQuery Mode]
        IMP[Import Mode]
        COMP[Composite Model]
    end

    subgraph "Connection Types"
        SYN1[Synapse Serverless<br/>SQL Endpoint]
        SYN2[Synapse Dedicated<br/>SQL Pool]
        LAKE[Direct Lake<br/>Mode]
    end

    DS --> CRED
    CRED --> DQ
    CRED --> IMP
    DQ --> SYN1
    IMP --> SYN2
    COMP --> LAKE

    style DS fill:#ffd,stroke:#333,stroke-width:2px
    style LAKE fill:#dfd,stroke:#333,stroke-width:2px

Power BI Connection String:

Data Source=syn-matomo-analytics-ondemand.sql.azuresynapse.net;
Initial Catalog=master;
Authentication=Azure Active Directory;

Data Processing Patterns

Incremental Processing

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from delta.tables import DeltaTable

spark = SparkSession.builder \
    .appName("MatomoIncrementalProcessing") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Read new data from Bronze
bronze_path = "abfss://bronze@stmatomoanalytics.dfs.core.windows.net/matomo/visits/"
silver_path = "abfss://silver@stmatomoanalytics.dfs.core.windows.net/matomo/visits/"

# Read latest Bronze data
new_data = spark.read.json(bronze_path + f"/{date.today()}/*.json")

# Transform data
transformed_data = new_data \
    .withColumn("processed_timestamp", current_timestamp()) \
    .withColumn("date", to_date(col("date"))) \
    .select(
        "date",
        "nb_visits",
        "nb_uniq_visitors",
        "nb_pageviews",
        "avg_time_on_site",
        "bounce_rate",
        "processed_timestamp"
    )

# Merge into Silver layer (Delta format)
if DeltaTable.isDeltaTable(spark, silver_path):
    silver_table = DeltaTable.forPath(spark, silver_path)

    silver_table.alias("existing") \
        .merge(
            transformed_data.alias("new"),
            "existing.date = new.date"
        ) \
        .whenMatchedUpdateAll() \
        .whenNotMatchedInsertAll() \
        .execute()
else:
    transformed_data.write \
        .format("delta") \
        .mode("overwrite") \
        .save(silver_path)

# Create Gold layer aggregations
gold_path = "abfss://gold@stmatomoanalytics.dfs.core.windows.net/matomo/weekly_summary/"

weekly_summary = spark.read.format("delta").load(silver_path) \
    .groupBy(window("date", "7 days")) \
    .agg(
        sum("nb_visits").alias("total_visits"),
        sum("nb_uniq_visitors").alias("total_unique_visitors"),
        sum("nb_pageviews").alias("total_pageviews"),
        avg("avg_time_on_site").alias("avg_time_on_site"),
        avg("bounce_rate").alias("avg_bounce_rate")
    ) \
    .select(
        col("window.start").alias("week_start"),
        col("window.end").alias("week_end"),
        "*"
    ) \
    .drop("window")

weekly_summary.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save(gold_path)

Performance Optimization

Partitioning Strategy

-- Partition by date for time-series queries
CREATE EXTERNAL TABLE matomo.visits_partitioned
(
    date DATE,
    nb_visits INT,
    nb_uniq_visitors INT,
    nb_pageviews INT,
    avg_time_on_site FLOAT,
    bounce_rate FLOAT
)
WITH (
    LOCATION = 'silver/matomo/visits/',
    DATA_SOURCE = MatomoDataLake,
    FILE_FORMAT = ParquetFormat,
    -- Partition by year and month
    PARTITION (year INT, month INT)
);

Indexing and Statistics

-- Create statistics for query optimization
CREATE STATISTICS stat_visits_date 
ON matomo.visits(date);

CREATE STATISTICS stat_visits_metrics 
ON matomo.visits(nb_visits, nb_pageviews);

-- Update statistics regularly
UPDATE STATISTICS matomo.visits;

Cost Optimization

Storage Tiers

graph LR
    subgraph "Lifecycle Management"
        HOT[Hot Tier<br/>0-30 days<br/>$0.0184/GB]
        COOL[Cool Tier<br/>30-90 days<br/>$0.01/GB]
        ARCH[Archive Tier<br/>90+ days<br/>$0.00099/GB]
    end

    HOT -->|After 30 days| COOL
    COOL -->|After 90 days| ARCH

    style HOT fill:#fdd,stroke:#333,stroke-width:2px
    style COOL fill:#ffd,stroke:#333,stroke-width:2px
    style ARCH fill:#dfd,stroke:#333,stroke-width:2px

Azure CLI for lifecycle policy:

az storage account management-policy create \
    --account-name $STORAGE_ACCOUNT \
    --resource-group $RESOURCE_GROUP \
    --policy @lifecycle-policy.json

lifecycle-policy.json:

{
    "rules": [
        {
            "name": "MatomoDataLifecycle",
            "enabled": true,
            "type": "Lifecycle",
            "definition": {
                "filters": {
                    "blobTypes": ["blockBlob"],
                    "prefixMatch": ["bronze/matomo"]
                },
                "actions": {
                    "baseBlob": {
                        "tierToCool": {
                            "daysAfterModificationGreaterThan": 30
                        },
                        "tierToArchive": {
                            "daysAfterModificationGreaterThan": 90
                        },
                        "delete": {
                            "daysAfterModificationGreaterThan": 365
                        }
                    }
                }
            }
        }
    ]
}

Monitoring and Alerts

Azure Monitor Queries

// Function execution monitoring
AzureFunctionAppLogs
| where FunctionName == "ExtractMatomoData"
| summarize 
    SuccessCount = countif(Level == "Information"),
    ErrorCount = countif(Level == "Error"),
    AvgDuration = avg(DurationMs)
    by bin(TimeGenerated, 1h)
| render timechart

// Data Lake storage metrics
AzureMetrics
| where ResourceProvider == "MICROSOFT.STORAGE"
| where MetricName == "BlobCount" or MetricName == "BlobCapacity"
| summarize avg(Total) by MetricName, bin(TimeGenerated, 1h)
| render timechart

// Synapse query performance
SynapseRbacOperations
| where OperationName contains "SELECT"
| summarize 
    QueryCount = count(),
    AvgDuration = avg(DurationMs),
    P95Duration = percentile(DurationMs, 95)
    by bin(TimeGenerated, 1h)

Security Configuration

Managed Identity Setup

# Enable managed identity for Function App
az functionapp identity assign \
    --name $FUNCTION_APP \
    --resource-group $RESOURCE_GROUP

# Grant access to Key Vault
az keyvault set-policy \
    --name $KEY_VAULT \
    --object-id <managed-identity-object-id> \
    --secret-permissions get list

# Grant access to Data Lake
az role assignment create \
    --assignee <managed-identity-object-id> \
    --role "Storage Blob Data Contributor" \
    --scope /subscriptions/<subscription-id>/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT

Cost Analysis

Monthly Cost Breakdown

Component Configuration Est. Monthly Cost
Azure Function Consumption Plan, 720 executions $5
Data Lake Storage 500GB Hot, 1TB Cool, 2TB Archive $45
Synapse Serverless SQL 1TB queries/month $5
Synapse Spark Pool 4 nodes, 2 hours/day $150
Power BI Pro 10 users $100
Key Vault Standard tier $5
Total ~$310/month

Cost Optimization Strategies

  1. Use Serverless SQL instead of dedicated pools when possible
  2. Implement data lifecycle policies to move old data to cheaper tiers
  3. Optimize Spark jobs to reduce processing time
  4. Use Delta Lake format for better compression (3-8x)
  5. Schedule processing during off-peak hours
  6. Monitor and optimize query patterns

Disaster Recovery

Backup Strategy

# Enable soft delete for Data Lake
az storage account blob-service-properties update \
    --account-name $STORAGE_ACCOUNT \
    --resource-group $RESOURCE_GROUP \
    --enable-delete-retention true \
    --delete-retention-days 30

# Configure geo-redundant storage
az storage account update \
    --name $STORAGE_ACCOUNT \
    --resource-group $RESOURCE_GROUP \
    --sku Standard_GRS

Next Steps

  1. Implement the Azure Function for data extraction
  2. Set up Synapse workspace and pipelines
  3. Configure Power BI datasets and reports
  4. Establish monitoring and alerting
  5. Consider Real-time Streaming for live data requirements