Data Engineering8 min read2024-06-12
Medallion Architecture on Azure: Databricks + Synapse in Production
How we designed a Bronze → Silver → Gold medallion lake at Gamasuite to handle real-time freight data from 85+ carriers — with zero data loss guarantees.
Azure DatabricksSynapse AnalyticsData LakeETL
What Is Medallion Architecture?
The medallion architecture (or multi-hop architecture) organises a data lake into three layers:
- Bronze — raw ingested data, exactly as received
- Silver — cleaned, deduplicated, conformed data
- Gold — business-aggregated, query-ready datasets
At Gamasuite we processed live freight rates from 85+ carriers including MAERSK, HAPAG-LLOYD, CMA-CGM, and WAN-HAI. Latency was critical — users expected sub-second quote comparisons.
Our Azure Stack
Carrier APIs / EDI → Azure Functions → Event Hub
↓
Azure Databricks (Bronze)
↓
Delta Lake (Silver — cleansed)
↓
Azure Synapse (Gold — aggregated)
↓
.NET API → Angular frontendKey Design Decisions
1. Delta Lake for ACID Transactions
Using Delta Lake on top of Azure Data Lake Storage Gen2 gave us:
- Time travel (audit & rollback)
- Schema enforcement on ingestion
- Concurrent read/write without locks
2. Structured Streaming for Near-Real-Time
We used Databricks Structured Streaming to continuously push Bronze → Silver:
# Bronze → Silver streaming pipeline
df_silver = (
spark.readStream
.format("delta")
.load("/mnt/bronze/freight-rates")
.filter("is_valid = true")
.dropDuplicates(["carrier_id", "rate_id", "effective_date"])
)
df_silver.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/mnt/checkpoints/silver") \
.start("/mnt/silver/freight-rates")3. Synapse Analytics for Gold Aggregation
Gold layer ran in Synapse for BI and our own comparison API:
CREATE VIEW gold.carrier_rate_comparison AS
SELECT
origin_port,
destination_port,
container_type,
carrier_name,
MIN(total_rate_usd) AS min_rate,
AVG(total_rate_usd) AS avg_rate,
COUNT(*) AS quote_count
FROM silver.freight_rates
WHERE effective_date >= DATEADD(day, -7, GETDATE())
GROUP BY origin_port, destination_port, container_type, carrier_name;Results
- 85+ carriers integrated via a single unified schema
- <500ms average latency from carrier response to UI quote card
- Zero downtime during carrier API schema changes (schema evolution via Delta)
- 60% reduction in manual data reconciliation effort