All articles
Data Engineering8 min read2024-06-12

Medallion Architecture on Azure: Databricks + Synapse in Production

How we designed a Bronze → Silver → Gold medallion lake at Gamasuite to handle real-time freight data from 85+ carriers — with zero data loss guarantees.

Azure DatabricksSynapse AnalyticsData LakeETL

What Is Medallion Architecture?

The medallion architecture (or multi-hop architecture) organises a data lake into three layers:

  • Bronze — raw ingested data, exactly as received
  • Silver — cleaned, deduplicated, conformed data
  • Gold — business-aggregated, query-ready datasets

At Gamasuite we processed live freight rates from 85+ carriers including MAERSK, HAPAG-LLOYD, CMA-CGM, and WAN-HAI. Latency was critical — users expected sub-second quote comparisons.

Our Azure Stack

Carrier APIs / EDI → Azure Functions → Event Hub

                              Azure Databricks (Bronze)

                              Delta Lake (Silver — cleansed)

                              Azure Synapse (Gold — aggregated)

                              .NET API → Angular frontend

Key Design Decisions

1. Delta Lake for ACID Transactions

Using Delta Lake on top of Azure Data Lake Storage Gen2 gave us:

  • Time travel (audit & rollback)
  • Schema enforcement on ingestion
  • Concurrent read/write without locks

2. Structured Streaming for Near-Real-Time

We used Databricks Structured Streaming to continuously push Bronze → Silver:

# Bronze → Silver streaming pipeline
df_silver = (
  spark.readStream
    .format("delta")
    .load("/mnt/bronze/freight-rates")
    .filter("is_valid = true")
    .dropDuplicates(["carrier_id", "rate_id", "effective_date"])
)
 
df_silver.writeStream \
  .format("delta") \
  .outputMode("append") \
  .option("checkpointLocation", "/mnt/checkpoints/silver") \
  .start("/mnt/silver/freight-rates")

3. Synapse Analytics for Gold Aggregation

Gold layer ran in Synapse for BI and our own comparison API:

CREATE VIEW gold.carrier_rate_comparison AS
SELECT 
  origin_port,
  destination_port,
  container_type,
  carrier_name,
  MIN(total_rate_usd) AS min_rate,
  AVG(total_rate_usd) AS avg_rate,
  COUNT(*) AS quote_count
FROM silver.freight_rates
WHERE effective_date >= DATEADD(day, -7, GETDATE())
GROUP BY origin_port, destination_port, container_type, carrier_name;

Results

  • 85+ carriers integrated via a single unified schema
  • <500ms average latency from carrier response to UI quote card
  • Zero downtime during carrier API schema changes (schema evolution via Delta)
  • 60% reduction in manual data reconciliation effort