Azure Data Lake Integration¶
Overview¶
The Azure Data Lake integration approach leverages Azure's data platform to create a scalable, cost-effective solution for storing and analyzing large volumes of Matomo analytics data. This approach is ideal for organizations that need to combine Matomo data with other data sources and perform complex analytics.
Architecture¶
graph TB
subgraph "Matomo Cloud"
MA[Matomo Analytics]
MAPI[Reporting API]
end
subgraph "Data Extraction Layer"
AF[Azure Function<br/>Timer Triggered]
AKV[Azure Key Vault<br/>Secrets]
end
subgraph "Storage Layer"
subgraph "Azure Data Lake Gen2"
BR[Bronze<br/>Raw JSON]
SI[Silver<br/>Cleaned Parquet]
GO[Gold<br/>Aggregated Data]
end
end
subgraph "Processing Layer"
SYN[Azure Synapse<br/>Analytics]
SP[Spark Pools<br/>Optional]
SQL[Serverless SQL]
end
subgraph "Serving Layer"
PBI[Power BI Service]
DASH[Dashboards]
end
MA --> MAPI
MAPI -->|HTTPS/JSON| AF
AKV -->|Credentials| AF
AF -->|Raw Data| BR
BR -->|Transform| SI
SI -->|Aggregate| GO
BR --> SQL
SI --> SQL
GO --> SQL
SQL --> PBI
SP --> SI
SP --> GO
PBI --> DASH
style MA fill:#f9f,stroke:#333,stroke-width:2px
style DASH fill:#9f9,stroke:#333,stroke-width:2px
style BR fill:#ffd,stroke:#333,stroke-width:2px
style SI fill:#dfd,stroke:#333,stroke-width:2px
style GO fill:#dff,stroke:#333,stroke-width:2px
Medallion Architecture¶
graph LR
subgraph "Data Lake Zones"
subgraph "Bronze Layer"
B1[Raw JSON Files]
B2[Original Format]
B3[Immutable]
end
subgraph "Silver Layer"
S1[Cleaned Data]
S2[Parquet Format]
S3[Deduplicated]
S4[Validated]
end
subgraph "Gold Layer"
G1[Business Ready]
G2[Aggregated]
G3[Optimized]
G4[Star Schema]
end
end
B1 --> S1
S1 --> G1
style B1 fill:#ffd,stroke:#333,stroke-width:2px
style S1 fill:#dfd,stroke:#333,stroke-width:2px
style G1 fill:#dff,stroke:#333,stroke-width:2px
Implementation Steps¶
Step 1: Set Up Azure Resources¶
# Variables
RESOURCE_GROUP="rg-matomo-analytics"
LOCATION="westeurope"
STORAGE_ACCOUNT="stmatomoanalytics"
SYNAPSE_WORKSPACE="syn-matomo-analytics"
FUNCTION_APP="func-matomo-extractor"
KEY_VAULT="kv-matomo-secrets"
# Create Resource Group
az group create --name $RESOURCE_GROUP --location $LOCATION
# Create Storage Account with Data Lake Gen2
az storage account create \
--name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--sku Standard_LRS \
--kind StorageV2 \
--hierarchical-namespace true
# Create containers for medallion architecture
az storage container create \
--name bronze \
--account-name $STORAGE_ACCOUNT
az storage container create \
--name silver \
--account-name $STORAGE_ACCOUNT
az storage container create \
--name gold \
--account-name $STORAGE_ACCOUNT
# Create Synapse Workspace
az synapse workspace create \
--name $SYNAPSE_WORKSPACE \
--resource-group $RESOURCE_GROUP \
--storage-account $STORAGE_ACCOUNT \
--file-system workspace \
--sql-admin-login-user sqladmin \
--sql-admin-login-password "ComplexPassword123!"
# Create Key Vault
az keyvault create \
--name $KEY_VAULT \
--resource-group $RESOURCE_GROUP \
--location $LOCATION
# Store Matomo API Token
az keyvault secret set \
--vault-name $KEY_VAULT \
--name "MatomoApiToken" \
--value "your-matomo-api-token"
Step 2: Create Azure Function for Data Extraction¶
using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.Logging;
using Newtonsoft.Json;
using Azure.Storage.Blobs;
using Azure.Security.KeyVault.Secrets;
using Azure.Identity;
using System.Net.Http;
public static class MatomoDataExtractor
{
private static readonly HttpClient httpClient = new HttpClient();
[FunctionName("ExtractMatomoData")]
public static async Task Run(
[TimerTrigger("0 0 */1 * * *")] TimerInfo myTimer,
ILogger log)
{
log.LogInformation($"Matomo data extraction started at: {DateTime.Now}");
// Get configuration
var keyVaultUrl = Environment.GetEnvironmentVariable("KeyVaultUrl");
var storageConnectionString = Environment.GetEnvironmentVariable("StorageConnectionString");
var matomoBaseUrl = Environment.GetEnvironmentVariable("MatomoBaseUrl");
var siteId = Environment.GetEnvironmentVariable("MatomoSiteId");
// Retrieve API token from Key Vault
var client = new SecretClient(new Uri(keyVaultUrl), new DefaultAzureCredential());
var apiToken = await client.GetSecretAsync("MatomoApiToken");
// Define reports to extract
var reports = new[]
{
"VisitsSummary.get",
"Actions.getPageUrls",
"UserCountry.getCountry",
"DevicesDetection.getType",
"Referrers.getReferrerType"
};
// Extract data for each report
foreach (var report in reports)
{
await ExtractReport(
matomoBaseUrl,
siteId,
apiToken.Value.Value,
report,
storageConnectionString,
log
);
}
log.LogInformation($"Matomo data extraction completed at: {DateTime.Now}");
}
private static async Task ExtractReport(
string baseUrl,
string siteId,
string apiToken,
string reportMethod,
string storageConnectionString,
ILogger log)
{
try
{
// Build API URL
var apiUrl = $"{baseUrl}/index.php" +
$"?module=API" +
$"&method={reportMethod}" +
$"&idSite={siteId}" +
$"&period=day" +
$"&date=yesterday" +
$"&format=JSON" +
$"&token_auth={apiToken}";
// Fetch data from Matomo
var response = await httpClient.GetStringAsync(apiUrl);
// Save to Bronze layer in Data Lake
var blobServiceClient = new BlobServiceClient(storageConnectionString);
var containerClient = blobServiceClient.GetBlobContainerClient("bronze");
var timestamp = DateTime.UtcNow.ToString("yyyyMMddHHmmss");
var blobName = $"matomo/{reportMethod}/{DateTime.UtcNow:yyyy/MM/dd}/{timestamp}.json";
var blobClient = containerClient.GetBlobClient(blobName);
using (var stream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(response)))
{
await blobClient.UploadAsync(stream, overwrite: true);
}
log.LogInformation($"Successfully extracted {reportMethod} to {blobName}");
}
catch (Exception ex)
{
log.LogError($"Error extracting {reportMethod}: {ex.Message}");
throw;
}
}
}
Step 3: Create Synapse Pipeline for Data Transformation¶
{
"name": "MatomoDataTransformation",
"properties": {
"activities": [
{
"name": "BronzeToSilver",
"type": "DataFlow",
"policy": {
"timeout": "1.00:00:00",
"retry": 2
},
"typeProperties": {
"dataflow": {
"referenceName": "TransformMatomoData",
"type": "DataFlowReference"
}
}
},
{
"name": "SilverToGold",
"type": "SparkJobDefinition",
"dependsOn": [
{
"activity": "BronzeToSilver",
"dependencyConditions": ["Succeeded"]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 1
},
"typeProperties": {
"sparkJob": {
"referenceName": "AggregateMatomoData",
"type": "SparkJobDefinitionReference"
}
}
}
],
"annotations": ["Matomo", "ETL", "DataLake"]
}
}
Step 4: Create Serverless SQL Views¶
-- Create external data source for Data Lake
CREATE EXTERNAL DATA SOURCE MatomoDataLake
WITH (
LOCATION = 'https://stmatomoanalytics.dfs.core.windows.net',
CREDENTIAL = ManagedIdentityCredential
);
-- Create external file format for Parquet
CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
-- Create schema for Matomo data
CREATE SCHEMA matomo;
GO
-- Create view for visits data (Silver layer)
CREATE VIEW matomo.visits AS
SELECT
date,
nb_visits,
nb_uniq_visitors,
nb_pageviews,
avg_time_on_site,
bounce_rate,
nb_conversions,
conversion_rate
FROM
OPENROWSET(
BULK 'silver/matomo/visits/*/*.parquet',
DATA_SOURCE = 'MatomoDataLake',
FORMAT = 'PARQUET'
) AS [result];
-- Create view for page analytics (Gold layer)
CREATE VIEW matomo.page_analytics AS
SELECT
date,
page_url,
page_title,
entry_nb_visits,
entry_bounce_count,
exit_nb_visits,
avg_time_on_page,
bounce_rate,
exit_rate
FROM
OPENROWSET(
BULK 'gold/matomo/page_analytics/*/*.parquet',
DATA_SOURCE = 'MatomoDataLake',
FORMAT = 'PARQUET'
) AS [result];
-- Create materialized view for dashboard performance
CREATE MATERIALIZED VIEW matomo.daily_summary
WITH (DISTRIBUTION = HASH(date))
AS
SELECT
date,
SUM(nb_visits) as total_visits,
SUM(nb_uniq_visitors) as unique_visitors,
SUM(nb_pageviews) as pageviews,
AVG(avg_time_on_site) as avg_session_duration,
AVG(bounce_rate) as avg_bounce_rate
FROM matomo.visits
GROUP BY date;
Step 5: Configure Power BI Connection¶
graph TD
subgraph "Power BI Configuration"
DS[Data Source Settings]
CRED[Azure AD Authentication]
DQ[DirectQuery Mode]
IMP[Import Mode]
COMP[Composite Model]
end
subgraph "Connection Types"
SYN1[Synapse Serverless<br/>SQL Endpoint]
SYN2[Synapse Dedicated<br/>SQL Pool]
LAKE[Direct Lake<br/>Mode]
end
DS --> CRED
CRED --> DQ
CRED --> IMP
DQ --> SYN1
IMP --> SYN2
COMP --> LAKE
style DS fill:#ffd,stroke:#333,stroke-width:2px
style LAKE fill:#dfd,stroke:#333,stroke-width:2px
Power BI Connection String:
Data Source=syn-matomo-analytics-ondemand.sql.azuresynapse.net;
Initial Catalog=master;
Authentication=Azure Active Directory;
Data Processing Patterns¶
Incremental Processing¶
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from delta.tables import DeltaTable
spark = SparkSession.builder \
.appName("MatomoIncrementalProcessing") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# Read new data from Bronze
bronze_path = "abfss://bronze@stmatomoanalytics.dfs.core.windows.net/matomo/visits/"
silver_path = "abfss://silver@stmatomoanalytics.dfs.core.windows.net/matomo/visits/"
# Read latest Bronze data
new_data = spark.read.json(bronze_path + f"/{date.today()}/*.json")
# Transform data
transformed_data = new_data \
.withColumn("processed_timestamp", current_timestamp()) \
.withColumn("date", to_date(col("date"))) \
.select(
"date",
"nb_visits",
"nb_uniq_visitors",
"nb_pageviews",
"avg_time_on_site",
"bounce_rate",
"processed_timestamp"
)
# Merge into Silver layer (Delta format)
if DeltaTable.isDeltaTable(spark, silver_path):
silver_table = DeltaTable.forPath(spark, silver_path)
silver_table.alias("existing") \
.merge(
transformed_data.alias("new"),
"existing.date = new.date"
) \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
else:
transformed_data.write \
.format("delta") \
.mode("overwrite") \
.save(silver_path)
# Create Gold layer aggregations
gold_path = "abfss://gold@stmatomoanalytics.dfs.core.windows.net/matomo/weekly_summary/"
weekly_summary = spark.read.format("delta").load(silver_path) \
.groupBy(window("date", "7 days")) \
.agg(
sum("nb_visits").alias("total_visits"),
sum("nb_uniq_visitors").alias("total_unique_visitors"),
sum("nb_pageviews").alias("total_pageviews"),
avg("avg_time_on_site").alias("avg_time_on_site"),
avg("bounce_rate").alias("avg_bounce_rate")
) \
.select(
col("window.start").alias("week_start"),
col("window.end").alias("week_end"),
"*"
) \
.drop("window")
weekly_summary.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.save(gold_path)
Performance Optimization¶
Partitioning Strategy¶
-- Partition by date for time-series queries
CREATE EXTERNAL TABLE matomo.visits_partitioned
(
date DATE,
nb_visits INT,
nb_uniq_visitors INT,
nb_pageviews INT,
avg_time_on_site FLOAT,
bounce_rate FLOAT
)
WITH (
LOCATION = 'silver/matomo/visits/',
DATA_SOURCE = MatomoDataLake,
FILE_FORMAT = ParquetFormat,
-- Partition by year and month
PARTITION (year INT, month INT)
);
Indexing and Statistics¶
-- Create statistics for query optimization
CREATE STATISTICS stat_visits_date
ON matomo.visits(date);
CREATE STATISTICS stat_visits_metrics
ON matomo.visits(nb_visits, nb_pageviews);
-- Update statistics regularly
UPDATE STATISTICS matomo.visits;
Cost Optimization¶
Storage Tiers¶
graph LR
subgraph "Lifecycle Management"
HOT[Hot Tier<br/>0-30 days<br/>$0.0184/GB]
COOL[Cool Tier<br/>30-90 days<br/>$0.01/GB]
ARCH[Archive Tier<br/>90+ days<br/>$0.00099/GB]
end
HOT -->|After 30 days| COOL
COOL -->|After 90 days| ARCH
style HOT fill:#fdd,stroke:#333,stroke-width:2px
style COOL fill:#ffd,stroke:#333,stroke-width:2px
style ARCH fill:#dfd,stroke:#333,stroke-width:2px
Azure CLI for lifecycle policy:
az storage account management-policy create \
--account-name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--policy @lifecycle-policy.json
lifecycle-policy.json:
{
"rules": [
{
"name": "MatomoDataLifecycle",
"enabled": true,
"type": "Lifecycle",
"definition": {
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["bronze/matomo"]
},
"actions": {
"baseBlob": {
"tierToCool": {
"daysAfterModificationGreaterThan": 30
},
"tierToArchive": {
"daysAfterModificationGreaterThan": 90
},
"delete": {
"daysAfterModificationGreaterThan": 365
}
}
}
}
}
]
}
Monitoring and Alerts¶
Azure Monitor Queries¶
// Function execution monitoring
AzureFunctionAppLogs
| where FunctionName == "ExtractMatomoData"
| summarize
SuccessCount = countif(Level == "Information"),
ErrorCount = countif(Level == "Error"),
AvgDuration = avg(DurationMs)
by bin(TimeGenerated, 1h)
| render timechart
// Data Lake storage metrics
AzureMetrics
| where ResourceProvider == "MICROSOFT.STORAGE"
| where MetricName == "BlobCount" or MetricName == "BlobCapacity"
| summarize avg(Total) by MetricName, bin(TimeGenerated, 1h)
| render timechart
// Synapse query performance
SynapseRbacOperations
| where OperationName contains "SELECT"
| summarize
QueryCount = count(),
AvgDuration = avg(DurationMs),
P95Duration = percentile(DurationMs, 95)
by bin(TimeGenerated, 1h)
Security Configuration¶
Managed Identity Setup¶
# Enable managed identity for Function App
az functionapp identity assign \
--name $FUNCTION_APP \
--resource-group $RESOURCE_GROUP
# Grant access to Key Vault
az keyvault set-policy \
--name $KEY_VAULT \
--object-id <managed-identity-object-id> \
--secret-permissions get list
# Grant access to Data Lake
az role assignment create \
--assignee <managed-identity-object-id> \
--role "Storage Blob Data Contributor" \
--scope /subscriptions/<subscription-id>/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT
Cost Analysis¶
Monthly Cost Breakdown¶
| Component | Configuration | Est. Monthly Cost |
|---|---|---|
| Azure Function | Consumption Plan, 720 executions | $5 |
| Data Lake Storage | 500GB Hot, 1TB Cool, 2TB Archive | $45 |
| Synapse Serverless SQL | 1TB queries/month | $5 |
| Synapse Spark Pool | 4 nodes, 2 hours/day | $150 |
| Power BI Pro | 10 users | $100 |
| Key Vault | Standard tier | $5 |
| Total | ~$310/month |
Cost Optimization Strategies¶
- Use Serverless SQL instead of dedicated pools when possible
- Implement data lifecycle policies to move old data to cheaper tiers
- Optimize Spark jobs to reduce processing time
- Use Delta Lake format for better compression (3-8x)
- Schedule processing during off-peak hours
- Monitor and optimize query patterns
Disaster Recovery¶
Backup Strategy¶
# Enable soft delete for Data Lake
az storage account blob-service-properties update \
--account-name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--enable-delete-retention true \
--delete-retention-days 30
# Configure geo-redundant storage
az storage account update \
--name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--sku Standard_GRS
Next Steps¶
- Implement the Azure Function for data extraction
- Set up Synapse workspace and pipelines
- Configure Power BI datasets and reports
- Establish monitoring and alerting
- Consider Real-time Streaming for live data requirements