All articles
Backend16 min read2023-10-15

Migrating a 6-Year-Old .NET Core 2 SaaS to .NET 8: Lessons Learned

A pragmatic, zero-downtime migration strategy for a production freight SaaS — from .NET Core 2 to .NET 8 with microservice decomposition, async rewrites, and a 6x improvement in concurrent request capacity.

.NET 8MigrationMicroservicesPerformance.NETASP.NET Core

Why Migrate?

The Gamasuite codebase was born on .NET Core 2 in 2018. By 2023 it had:

  • 6 tightly-coupled services in a monorepo sharing a single deployment pipeline
  • Performance bottlenecks at 2,000+ concurrent rate requests, driven by synchronous blocking I/O throughout the carrier integration layer
  • No support for newer C# features (records, pattern matching, async streams, primary constructors)
  • End-of-life runtime with no security patches — a compliance liability for a SaaS handling commercial data
  • A test suite that had not kept pace with the codebase, creating deployment risk

The business case was straightforward: the performance ceiling was limiting growth (we had turned down enterprise customers citing latency SLA concerns), the security posture was weak, and developer velocity was degraded by the absence of modern C# features and tooling.

What was not straightforward was how to migrate a production system with 10,000+ daily active users, dozens of carrier API integrations, and no meaningful downtime budget.

Why Not a Big-Bang Rewrite?

The instinct when facing a legacy codebase is to propose a clean rewrite. This is almost always the wrong instinct, and it was clearly wrong here.

The Gamasuite codebase contained six years of accumulated business logic — freight rate normalisation algorithms, carrier-specific edge case handling, surcharge calculation rules, and booking workflow rules that lived in the code, not in documentation. A rewrite starting from scratch would take 12-18 months and would invariably miss business logic that was only discovered by observing production behaviour.

The strangler fig pattern — incrementally replacing pieces of the system from the outside in — was the right approach. Ship value continuously, maintain production stability, and let the migration happen over multiple quarters rather than in a single high-risk release.

Migration Strategy: Strangler Fig in Four Phases

Phase 0: Assessment and Preparation (2 weeks)

Before writing any migration code, we needed to understand what we were migrating. This phase produced:

Dependency map: Every service, its dependencies, and the coupling points between them. This identified the extraction order — which services could be migrated independently vs. which had too many dependencies to extract cleanly.

Performance profile: Using dotnet-trace and Application Insights, we identified the specific bottlenecks. 73% of latency on quote generation was in blocking carrier API calls. This immediately identified the Rate Engine as both the highest-impact and most technically independent extraction candidate.

Test coverage audit: Test coverage was 31% — dangerous for a migration. We spent one sprint raising coverage to 65% on the critical paths before touching any runtime code. This was the most important unglamorous work of the entire project.

Feature flag infrastructure: Configured Azure App Configuration for feature flags before the first migration change. Feature flags for gradual traffic routing were essential for the safe rollout strategy.

Phase 1: Upgrade In-Place to .NET 6 (3 weeks)

The first step was not .NET 8 — it was .NET 6. This intermediate step surfaced all deprecated API usage and compatibility issues without simultaneously introducing the performance-optimising rewrites that would make root causes harder to identify.

The upgrade process:

  1. Update global.json and *.csproj target frameworks to net6.0
  2. Run build — fix compilation errors (deprecated APIs, removed methods)
  3. Run tests — fix test failures from behaviour changes
  4. Deploy to staging — smoke test all critical paths
  5. Deploy to production with feature flags on 10% of traffic — monitor for regressions

.NET 6 is a Long Term Support release. Stabilising on it before migrating to .NET 8 gave us a stable intermediate point and reduced the blast radius of any single migration step.

Phase 2: Extract the Rate Engine Microservice (.NET 8) (6 weeks)

The rate calculation engine was the hottest path and the most independent service. We extracted it first — both for the highest immediate performance benefit and because successful extraction of one service proved the migration approach before committing to the full programme.

The key rewrite in the Rate Engine was the carrier API call pattern:

// Old: synchronous, blocking — each carrier waited for the previous
public List<CarrierRate> GetRates(RouteRequest request)
{
    var results = new List<CarrierRate>();
    foreach (var carrier in _carriers)
    {
        results.Add(_carrierClient.GetRate(carrier, request)); // blocking HTTP
    }
    return results;
}
 
// New: true parallelism with async streams — results stream to UI as each carrier responds
public async IAsyncEnumerable<CarrierRate> GetRatesAsync(
    RouteRequest request,
    [EnumeratorCancellation] CancellationToken ct = default)
{
    var tasks = _carriers.Select(c => _carrierClient.GetRateAsync(c, request, ct));
    
    await foreach (var result in Task.WhenEach(tasks).WithCancellation(ct))
    {
        yield return await result;
    }
}

The old implementation waited for the slowest carrier before returning anything. With 85 carriers, the slowest response determined the total latency — typically 2,400-2,800ms.

The new implementation uses Task.WhenEach (new in .NET 9, backported to our .NET 8 build via a compatibility shim) to yield results as each carrier responds. The user's quote card started populating in 180ms (fastest carrier response) and completed filling over the next 300-400ms as slower carriers responded. Total wall clock time dropped from 2,800ms to ~480ms at P95.

This alone gave us 40% latency reduction on quote generation.

Phase 3: Minimal APIs for Internal Services (4 weeks)

The internal microservice communication layer — previously using legacy controller-based APIs with heavyweight middleware — migrated to .NET 8 Minimal APIs:

var builder = WebApplication.CreateBuilder(args);
 
// Modern .NET 8 DI setup
builder.Services
    .AddScoped<IRateCalculatorService, RateCalculatorService>()
    .AddScoped<ICarrierRepository, CarrierRepository>()
    .AddOpenTelemetry()
        .WithTracing(tracing => tracing
            .AddAspNetCoreInstrumentation()
            .AddHttpClientInstrumentation()
            .AddSource("Gamasuite.RateEngine"));
 
var app = builder.Build();
 
// Minimal API endpoint — no controller boilerplate
app.MapPost("/rates/calculate", async (
    RateRequest req,
    IRateCalculatorService calculator,
    CancellationToken ct) =>
{
    var rates = await calculator.CalculateAsync(req, ct);
    return Results.Ok(rates);
})
.RequireAuthorization("internal-service")
.WithName("CalculateRates")
.WithOpenApi();
 
// Health check endpoint for Kubernetes liveness probe
app.MapHealthChecks("/health");
 
await app.RunAsync();

The Minimal API approach eliminated the controller layer's per-request overhead — controller instantiation, action filter execution, model binding for simple cases — reducing memory allocation per request significantly.

Phase 4: Remaining Services and Records Adoption (8 weeks)

The remaining services migrated in dependency order: the Booking Service (depended on Rate Engine), the Shipment Tracking Service (independent), the Customer Reporting Service, and finally the legacy Authentication Service (most coupled, most risky).

Each service migration followed the same pattern:

  1. Write integration tests against the existing service behaviour
  2. Implement the new .NET 8 service to pass those tests
  3. Deploy behind a feature flag at 1% traffic
  4. Ramp traffic to 10%, 25%, 50%, 100% over one week, monitoring error rates at each step
  5. Decommission the old service

The feature flag ramp was the mechanism that made zero-downtime migration possible. At any step, if metrics degraded, we ramped back — without a deployment event.

The Records adoption across the codebase was a separate thread of work — systematic replacement of mutable classes with records for immutable domain models:

// Old: mutable class with property setters
public class RouteRequest
{
    public string OriginPort { get; set; }
    public string DestinationPort { get; set; }
    public string ContainerType { get; set; }
    public DateTime RequiredBy { get; set; }
}
 
// New: immutable record with positional properties
public record RouteRequest(
    string OriginPort,
    string DestinationPort,
    string ContainerType,
    DateTime RequiredBy
)
{
    // Records support with-expressions for non-destructive mutation
    public RouteRequest WithRequiredBy(DateTime requiredBy) => this with { RequiredBy = requiredBy };
}

Records provide structural equality (two records with the same field values are equal), immutability by default, and the with expression syntax for creating modified copies without mutation. In a domain model with extensive use of value objects representing routes, rates, and booking parameters, this eliminated an entire category of equality-related bugs.

Key Performance Findings from Profiling

dotnet-trace revealed the actual bottlenecks, which differed significantly from the intuitive guesses:

Bottleneck 1 (73% of latency): Synchronous blocking carrier HTTP calls. Expected. Fixed by the async rewrite in Phase 2.

Bottleneck 2 (12% of latency): JSON serialisation of large carrier response objects using Newtonsoft.Json. Switched to System.Text.Json with pre-compiled JsonSerializerContext for zero-allocation serialisation. Reduced serialisation overhead by 78%.

Bottleneck 3 (8% of latency): Entity Framework Core N+1 queries in the booking confirmation flow. One query per shipment leg rather than a single query with .Include(). Fixed with explicit eager loading — 14 queries → 1 query, 340ms → 22ms for booking confirmation.

Bottleneck 4 (5% of latency): Unhashable dictionary keys forcing linear search. A domain model class was used as a Dictionary key without overriding GetHashCode(), causing O(n) dictionary lookups instead of O(1). Fixed by converting to a record (which generates value-based GetHashCode() automatically).

The N+1 query issue is worth dwelling on: it was not obvious from code review, invisible in unit tests, and only detectable by profiling with real production data volumes. This is why profiling before optimising is non-negotiable — intuition about performance bottlenecks is frequently wrong.

Key Metrics After Full Migration

| Metric | Before (.NET Core 2) | After (.NET 8) | |--------|---------------------|----------------| | Avg quote latency | 2,800ms | 480ms | | P95 quote latency | 4,200ms | 820ms | | Memory per instance | 1.2 GB | 380 MB | | Cold start (App Service) | 12s | 3s | | Concurrent requests/node | ~800 | ~3,200 | | Test coverage | 31% | 78% |

The memory reduction from 1.2GB to 380MB per instance was the most commercially significant metric — it allowed us to run 3x more instances on the same Azure App Service plan, reducing infrastructure cost while handling higher load.

Lessons Learned

1. Profile before you optimise — used dotnet-trace and Application Insights to find the real bottlenecks. The N+1 query issue (8% of latency, relatively easy to fix) would have been overlooked without profiling.

2. Async all the way down — partial async caused deadlocks in transitional code during migration. When a synchronous method called an async method and blocked on .Result, it deadlocked under load. The rule: once you start making a call chain async, every caller in the chain must be async too.

3. Feature flags for gradual rollout — Azure App Configuration with percentage-based traffic routing was the safety net that made the migration safe. Every service cutover went through the 1% → 10% → 25% → 50% → 100% ramp. This approach requires investment in observability (to detect regressions at low traffic percentages) and investment in writing code that handles requests routing to either old or new services correctly during the transition.

4. Test coverage before migration, not after — raising test coverage to 65% before the first migration code was the single most important risk-reduction step. It provided the regression detection that made rapid iteration safe.

5. Migration is a team skill, not a solo effort — the migration maintained velocity because the entire team understood the strangler fig pattern and the feature flag ramp approach. Engineers who understand why each step exists make better decisions when edge cases arise.


Muhammad Moid Shams is a Lead Software Engineer specialising in .NET, Azure, and enterprise SaaS platform engineering.