Medallion Architecture on Azure: Databricks + Synapse in Production
How we designed a Bronze → Silver → Gold medallion lake at Gamasuite to handle real-time freight data from 85+ carriers — with zero data loss guarantees, schema evolution, and sub-500ms latency from carrier API to user quote card.
What Is Medallion Architecture?
The medallion architecture (or multi-hop architecture) organises a data lake into three layers:
- Bronze — raw ingested data, exactly as received, with no transformation
- Silver — cleaned, deduplicated, conformed data in a standardised schema
- Gold — business-aggregated, query-ready datasets optimised for specific consumption patterns
The pattern solves a fundamental tension in analytics data systems: the need for fast access to processed, query-optimised data versus the need to reprocess raw data when business logic or quality rules change. By preserving raw data in Bronze, you can always re-derive Silver and Gold from scratch as requirements evolve — without needing to re-fetch from source systems.
At Gamasuite we processed live freight rates from 85+ carriers including MAERSK, HAPAG-LLOYD, CMA-CGM, and WAN-HAI. Latency was critical — users expected sub-second quote comparisons. Understanding exactly how the medallion layers enabled this at scale is worth exploring in detail.
Why Freight Rate Data Is Hard
Before getting to the architecture, it is worth understanding what makes freight rate data difficult enough to justify this level of infrastructure investment.
Volume: 85+ carriers, each with hundreds of lane-container type combinations, updated multiple times per day. Processing every rate update means handling tens of thousands of messages per hour during peak carrier update windows.
Heterogeneity: Each carrier provides rates in a different format — some via REST API with JSON, some via EDI (ANSI X12 or UN/EDIFACT), some via CSV file drop on SFTP, some via proprietary API with non-standard data models. Unifying these formats into a consistent schema requires substantial transformation logic.
Temporal complexity: Freight rates have complex temporal semantics — effective dates, expiry dates, booking validity windows, and surcharge schedules that stack on top of base rates. A "current" rate is not a point-in-time snapshot but a function of sailing date, booking date, and commodity type.
Correctness requirements: Users quote customers based on these rates, then buy space from carriers at those rates. A stale or incorrect rate in the Gold layer is not just a data quality problem — it is a financial liability. We needed correctness guarantees that went beyond "eventually consistent."
Our Azure Stack
Carrier APIs / EDI → Azure Functions → Event Hub
↓
Azure Databricks (Bronze ingestion)
↓
Delta Lake on ADLS Gen2 (Silver — cleansed)
↓
Azure Synapse Analytics (Gold — aggregated)
↓
Azure Cache for Redis (hot query cache)
↓
.NET 8 API → Angular frontendEach component has a specific role, and the choices at each level were deliberate.
Azure Functions for ingestion: Serverless, scales automatically with carrier update event volume, and keeps the ingestion layer decoupled from the processing layer. Functions write raw events to Event Hub for buffering.
Azure Event Hub: Provides durable event buffering between ingestion and processing, handles burst ingestion from multiple carriers updating simultaneously, and gives Databricks a stable read position through consumer groups.
Azure Databricks: The processing engine for both batch and streaming transformations. Databricks' Delta Lake format is the technical foundation that makes the medallion pattern reliable.
Azure Synapse Analytics: The Gold layer SQL warehouse — optimised for analytical queries from the comparison API and BI dashboards, with dedicated SQL pools for consistent query performance.
Azure Cache for Redis: Hot cache for the most-queried Gold layer data — reducing Synapse query load for frequently-accessed routes. Covered separately in our Redis caching post.
Key Design Decisions
1. Delta Lake for ACID Transactions
Using Delta Lake on top of Azure Data Lake Storage Gen2 gave us capabilities that vanilla Parquet on blob storage cannot provide:
ACID transactions: Multiple Databricks jobs writing to the same Delta table do not corrupt each other — Delta's transaction log ensures atomicity and isolation. This matters during carrier update surges when multiple streaming jobs are writing Silver simultaneously.
Time travel: Every Delta table maintains a transaction log that enables reading the table as of any previous point in time. When we need to debug a data quality issue — "what did the Silver table look like at 14:32 yesterday?" — we can query it directly without any manual backup process.
Schema enforcement and evolution: Delta enforces schema on write, catching schema errors at ingestion time rather than at query time. When carriers change their API response schema (which happens regularly, often without notice), schema enforcement surfaces the issue immediately in the Bronze layer rather than silently corrupting Silver.
Schema evolution: When we need to add a column to a Silver table, Delta supports adding nullable columns to existing tables without rewriting historical data. Existing rows materialise the new column as null; new rows populate it. This allowed us to extend our data model as business requirements evolved without downtime.
# Bronze → Silver streaming pipeline with Delta Lake
df_silver = (
spark.readStream
.format("delta")
.load("/mnt/bronze/freight-rates")
.filter("is_valid = true")
.dropDuplicates(["carrier_id", "rate_id", "effective_date"])
# Conform to silver schema
.withColumn("rate_usd", col("total_charge").cast("decimal(18,4)"))
.withColumn("processed_at", current_timestamp())
.withColumn("data_source", lit("carrier-api"))
)
df_silver.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/mnt/checkpoints/silver-rates") \
.option("mergeSchema", "true") # Auto-merge schema evolution
.start("/mnt/silver/freight-rates")The checkpointLocation is critical — it tracks the exact position in the Bronze Event Hub stream so that if the streaming job fails, it resumes from where it left off rather than reprocessing or skipping events. This is the mechanism behind the zero-data-loss guarantee.
2. Structured Streaming for Near-Real-Time
Databricks Structured Streaming processes carrier rate events continuously — not in hourly or daily batches. The streaming micro-batch interval was tuned to 30 seconds: frequent enough to provide sub-minute freshness for the Silver layer, infrequent enough to not generate excessive Databricks job overhead.
The Bronze → Silver streaming job validates, deduplicates, and normalises incoming events. Deduplication is non-trivial: carriers occasionally send duplicate rate updates when retrying after perceived failures, and EDI feeds sometimes double-deliver. The dropDuplicates on ["carrier_id", "rate_id", "effective_date"] eliminates these at the Silver layer.
3. Carrier-Specific Parsers as Bronze Layer
The heterogeneity of carrier formats required carrier-specific parsing at the Bronze ingestion layer. Each carrier has its own Azure Function and its own parser class:
class MaerskRateParser(BaseCarrierParser):
"""Maersk Spot API response parser."""
def parse(self, raw_payload: dict) -> list[BronzeRateRecord]:
records = []
for quote in raw_payload.get("quotes", []):
for schedule in quote.get("schedules", []):
records.append(BronzeRateRecord(
carrier_id="MAERSK",
rate_id=quote["quotationId"],
origin_port=quote["originLocation"]["locationCode"],
destination_port=quote["destinationLocation"]["locationCode"],
container_type=quote["cargoType"],
effective_date=datetime.fromisoformat(schedule["departureDate"]),
expiry_date=datetime.fromisoformat(schedule["expiryDate"]),
total_charge=Decimal(str(schedule["totalPrice"]["value"])),
currency=schedule["totalPrice"]["currency"],
raw_payload=json.dumps(raw_payload), # Preserve raw for replay
is_valid=True,
parse_error=None
))
return records
def validate(self, record: BronzeRateRecord) -> tuple[bool, str | None]:
if not record.origin_port or not record.destination_port:
return False, "Missing port codes"
if record.total_charge <= 0:
return False, "Invalid charge amount"
if record.expiry_date <= record.effective_date:
return False, "Expiry before effective date"
return True, NoneNote the raw_payload field storing the original JSON. The Bronze layer stores the raw payload alongside the parsed record — enabling replay when parsing logic changes. When we discover a parser bug that was silently corrupting data, we can fix the parser and reprocess the Bronze layer's raw payloads to regenerate clean Silver data.
4. Synapse Analytics for Gold Aggregation
Gold layer ran in Synapse for two consumption patterns: BI reporting (Power BI connected to Synapse) and the comparison API (our .NET API queried Synapse for rate comparisons).
-- Gold layer materialised view for carrier comparison
CREATE VIEW gold.carrier_rate_comparison AS
SELECT
origin_port,
destination_port,
container_type,
carrier_name,
MIN(total_rate_usd) AS min_rate,
AVG(total_rate_usd) AS avg_rate,
MAX(total_rate_usd) AS max_rate,
COUNT(*) AS quote_count,
MAX(updated_at) AS last_updated
FROM silver.freight_rates
WHERE effective_date >= DATEADD(day, -7, GETDATE())
AND expiry_date >= GETDATE() -- Only currently-valid rates
GROUP BY origin_port, destination_port, container_type, carrier_name;
-- Gold layer pre-aggregated table (refreshed every 5 minutes via Synapse pipeline)
CREATE TABLE gold.route_summary
WITH (DISTRIBUTION = HASH(route_key), CLUSTERED COLUMNSTORE INDEX)
AS
SELECT
CONCAT(origin_port, '-', destination_port, '-', container_type) AS route_key,
origin_port,
destination_port,
container_type,
COUNT(DISTINCT carrier_name) AS carrier_count,
MIN(total_rate_usd) AS best_rate,
AVG(total_rate_usd) AS avg_market_rate,
GETDATE() AS calculated_at
FROM silver.freight_rates
WHERE is_valid = 1
AND effective_date <= GETDATE()
AND expiry_date >= GETDATE()
GROUP BY origin_port, destination_port, container_type;The route_summary table is pre-aggregated and stored as a clustered columnstore index — Synapse's most efficient storage format for analytical queries. This table is refreshed every 5 minutes by a Synapse pipeline, trading the cost of a refresh job for dramatically faster query response times compared to computing the aggregation on every API request.
5. Data Quality Framework
Data quality issues in freight rate data are not hypothetical — they are constant. Carriers submit negative charges, future-dated effective dates, impossible weight ranges, and schema variations that break naive parsers. We implemented a data quality framework that quantified quality at each layer:
# Silver layer data quality checks — run after each streaming micro-batch
def check_silver_quality(df: DataFrame, batch_id: int) -> None:
total = df.count()
quality_checks = {
"no_negative_rates": df.filter(col("total_rate_usd") <= 0).count(),
"valid_port_codes": df.filter(~col("origin_port").rlike("[A-Z]{5}")).count(),
"future_effective_dates": df.filter(col("effective_date") > col("created_at")).count(),
"expired_on_arrival": df.filter(col("expiry_date") < col("effective_date")).count()
}
for check_name, failure_count in quality_checks.items():
failure_rate = failure_count / total if total > 0 else 0
# Log to Application Insights
logger.info(f"DQ Check {check_name}: {failure_rate:.2%} failure rate in batch {batch_id}")
# Alert if failure rate exceeds threshold
if failure_rate > 0.05: # 5% threshold
send_alert(f"Data quality issue in Silver: {check_name} at {failure_rate:.1%}")Quality metrics were surfaced in a Power BI dashboard visible to both the data engineering team and the business operations team — making data quality a visible metric rather than a hidden concern.
Operational Lessons
Lesson 1: Cost-based cluster autoscaling, not time-based. We initially used time-based scaling (scale up at 07:00, scale down at 22:00). Carrier update bursts did not follow business hours — HAPAG-LLOYD would push large rate updates at 03:00 UTC. Switching to cost-based autoscaling (scale up when Event Hub lag exceeds threshold, scale down when lag is clear) cut idle cluster costs by 35% and eliminated lag-induced data freshness degradation.
Lesson 2: Checkpoint locations matter more than code. Three times in production, the streaming pipeline resumed from the wrong position after a cluster restart — causing either data gaps or reprocessed duplicates. The root cause was each time a checkpoint directory misconfiguration. Now, checkpoint directory paths are parameterised from configuration and validated at job start. Any change to checkpoint paths goes through code review.
Lesson 3: Schema evolution is a business communication problem. When a carrier changes their API schema, the impact depends on how downstream Silver and Gold schemas handle it. We learned to treat carrier API schema changes as a business event requiring coordination — not purely a technical event handled by developers. A carrier changing how they represent surcharges could break a user's quote without an obvious error anywhere in the pipeline.
Lesson 4: Delta time travel is not a backup strategy. Time travel is valuable for debugging and limited-scope recovery, but Delta's default log retention of 7 days means data older than that is not accessible via time travel. Genuine backup requires Azure Storage geo-replication and periodic full snapshots exported to cold storage. We had a near-miss when investigating a data issue that turned out to have started 10 days before discovery.
Results
- 85+ carriers integrated via a single unified Bronze schema
- <500ms average latency from carrier response to UI quote card (including Silver processing and Gold refresh)
- Zero data loss during carrier API schema changes (schema evolution via Delta)
- Zero downtime during the quarterly carrier integration updates
- 60% reduction in manual data reconciliation effort — quality issues caught at Bronze and Silver layers instead of discovered in Gold by business users
- 35% infrastructure cost reduction from cluster autoscaling vs. time-based scaling
Muhammad Moid Shams is a Lead Software Engineer specialising in Azure data engineering, distributed systems, and healthcare data platforms.