10 minute read

11 min read 2312 words

In a monolithic application, maintaining data consistency is straightforward. Database operations are typically wrapped in a single transaction, ensuring that any failure triggers an automatic rollback of all changes.

However, in a Microservices architecture, each service manages its own database. In an E-commerce system, the Order Service, Payment Service, and Inventory Service operate independently.

How do you make sure that if the payment fails, the order is cancelled and the inventory is put back? This is where the SAGA Pattern comes in.


1. The Challenges of Distributed Transactions

In a monolithic world, we rely on ACID transactions. In a distributed world (Microservices), we historically attempted to use the Two-Phase Commit (2PC) protocol to maintain that same level of consistency.

1.1 Quick Recall: What is ACID?

Before we dive into the challenges of distributed systems, let’s briefly recall what we are trying to achieve. In a local database, a transaction must satisfy the ACID properties:

  1. Atomicity: All operations in a transaction succeed, or none do. It is an “all or nothing” deal.
  2. Consistency: A transaction takes the database from one valid state to another, maintaining all predefined rules (like unique keys or foreign keys).
  3. Isolation: Concurrent transactions do not interfere with each other. The result of running multiple transactions simultaneously should be the same as running them one after another.
  4. Durability: Once a transaction is committed, it remains committed even in the event of a system failure (like a power outage).

A Classic Example: Bank Transfer

Imagine transferring $100 from Account A to Account B:

  • Atomicity: $100 is deducted from A AND added to B. If either part fails, neither happens.
  • Consistency: The total sum of money in the bank remains the same before and after the transfer.
  • Isolation: If someone else checks the balance of A or B while the transfer is in progress, they should not see a state where the money has left A but not yet arrived at B.
  • Durability: Once you receive the “Transfer Successful” message, the $100 won’t “disappear” if the database server crashes.

1.2 The ACID Guarantee: Automatic or Not?

A common question for beginners is whether ACID is “automatic” in a Relational Database Management System (RDBMS). The answer is both: it is built into the design of the database engine, but it requires implementation work from the developer.

  • By Design: Modern RDBMS engines (like SQL Server, PostgreSQL, or MySQL with InnoDB) are designed with the complex machinery—such as Write-Ahead Logs (WAL) and locking mechanisms—required to enforce Atomicity, Consistency, Isolation, and Durability.
  • By Implementation: As a developer, you must explicitly define Transaction Boundaries. If you execute five separate SQL commands without wrapping them in a transaction block (e.g., using (var transaction = ...) in C#), the database treats each one as an independent transaction. If the third command fails, the first two remain committed. To achieve true ACID for a set of operations, you must group them into a single transaction.

Once you move to Microservices, this local ACID guarantee is lost across service boundaries because each service has its own independent database. This leads us to the historical solution: 2PC.

1.3 Understanding Two-Phase Commit (2PC)

Think of 2PC as a strict board meeting where a decision only passes if there is 100% consensus. There are two roles: the Coordinator (the chair) and the Participants (the board members).

Phase 1: The Voting Phase (Prepare)

The Coordinator sends a “Prepare” request to every service involved. Each service checks its own local database: “Can I perform this change?”

  • If a service can, it locks the necessary data and votes Yes.
  • If it cannot (e.g., a database error or validation failure), it votes No.

Phase 2: The Decision Phase (Commit/Abort)

The Coordinator collects all the votes:

  • If everyone voted Yes: The Coordinator sends a “Commit” command. All services finalize their changes and release their locks.
  • If anyone voted No (or failed to respond): The Coordinator sends an “Abort” command. All services discard their changes and release their locks.

1.4 Why we don’t use 2PC in Microservices

While 2PC sounds perfect on paper, it has significant drawbacks in modern architectures:

  1. The Blocking Problem: During the entire process, data is locked. If the Coordinator crashes after Phase 1, the Participants are left “hanging” and cannot release their locks, potentially paralyzing the system.
  2. High Latency: The transaction is only as fast as the slowest service. In a network-heavy environment, this leads to performance bottlenecks.
  3. Scalability & Coupling: 2PC requires all services to be available and responsive at the exact same time. It forces independent services to act as one large, fragile unit.

2. What is a SAGA?

A SAGA is a sequence of local transactions. Each local transaction updates the database and publishes a message or event to trigger the next local transaction in the SAGA.

If one of the steps fails, the SAGA executes a series of Compensating Transactions to undo the changes made by the preceding steps.

The Golden Rule:

A SAGA is not a single transaction. It is a “Eventually Consistent” workflow.


3. The Two Flavors of SAGA

There are two main ways to coordinate a SAGA: Choreography and Orchestration.

3.1 Choreography (Event-Based)

In this approach, there is no central “leader” or “conductor.” Instead, each service involved in the business process acts independently and reacts to events from other services.

  • How it works: Each service completes a local transaction and publishes an event to a message broker. Other services subscribe to those events and perform their own local transactions in response.
  • The Chain of Events: Service A (Order) publishes OrderCreated. Service B (Payment) reacts to OrderCreated and publishes PaymentCompleted. Service C (Inventory) reacts to PaymentCompleted and so on.
  • Pros:
    • Simple Implementation: Easy to add new participants by simply having them subscribe to existing events.
    • Loose Coupling: Services only need to know about the events, not about the other services’ internal logic.
    • Decentralized: No single point of failure for coordination.
  • Cons:
    • Complexity at Scale: It becomes difficult to visualize the entire workflow as the number of events increases.
    • Cyclic Dependencies: Risk of services accidentally creating infinite event loops.
    • Lack of Central State: Tracking the current status of a specific Saga (e.g., “Where is Order #123?”) is challenging without querying every service.

3.1.1 Rollback in Choreography (Compensating Events)

In a Choreography Saga, failure is just another event. Since there is no central orchestrator to trigger a rollback, the process is decentralized and relies on Compensating Events.

  1. Failure as a First-Class Event: When a service cannot complete its local transaction (e.g., the Payment Service finds insufficient funds), it publishes a “Failure” event (like PaymentFailed).
  2. Reverse Flow of Responsibility: Every service that completed a successful local transaction earlier in the chain must listen for these failure events. For instance, the Order Service must subscribe to PaymentFailed to update the order status to Cancelled.
  3. Compensating Transactions: Each service is responsible for its own “undo” logic. The Payment Service doesn’t tell the Order Service what to do; the Order Service knows what to do because it hears the PaymentFailed event.
  4. Implicit Knowledge: This is the biggest challenge of Choreography. Each service must “know” which events from which downstream services indicate a failure that requires it to roll back its own local changes.

3.2 Orchestration (Command-Based)

This approach is analogous to a conducted orchestra. A central Orchestrator manages the workflow and directs participants.

  • How it works: A central “Saga State Machine” tells Service A, “Do your job.” Once Service A reports back, the Orchestrator tells Service B, “Now it’s your turn.”
  • Pros: Easy to see the whole workflow in one place. Better for complex logic.
  • Cons: The Orchestrator can become a “God Object” if not careful.

4. Real-World Example: Ordering a Pizza

Let’s see how a SAGA handles an order:

  1. Order Service: Creates an order in Pending state. (Step 1)
  2. Payment Service: Charges the customer’s card. (Step 2)
  3. Inventory Service: Reserves the pizza ingredients. (Step 3)

The Successful Path (Choreography Style)

  1. Order Service publishes OrderCreated.
  2. Payment Service reacts to OrderCreated, charges the card, and publishes PaymentCompleted.
  3. Inventory Service reacts to PaymentCompleted and reserves the stock.

The Rollback Path (Choreography Style)

If the required inventory is unavailable, the SAGA must initiate a decentralized rollback:

  1. Inventory Service publishes a StockFailed event.
  2. Payment Service (which previously processed the payment) is listening for StockFailed. It performs a Refund (its compensating transaction).
  3. Order Service is also listening for StockFailed. It receives the failure notification and updates the order status to Cancelled.

5. Implementing SAGA in .NET

While you can build a SAGA from scratch using raw Messaging (like RabbitMQ or Azure Service Bus), it is much easier to use a library that handles the heavy lifting.

The Industry Standard: MassTransit

In the .NET ecosystem, MassTransit is a widely adopted framework for implementing Sagas. It provides a robust State Machine feature (via the Automatonymous library) that allows for defining Saga logic in a structured and maintainable manner.

// Example of a simple Saga State Machine definition
public class OrderSaga : MassTransitStateMachine<OrderState>
{
    public OrderSaga()
    {
        InstanceState(x => x.CurrentState);

        // When an Order is Submitted, go to 'Submitted' state and ask Payment Service
        Initially(
            When(OrderSubmitted)
                .Then(context => context.Saga.OrderId = context.Message.OrderId)
                .TransitionTo(Submitted)
                .Publish(context => new AuthorizePayment { OrderId = context.Saga.OrderId })
        );

        // If Payment is Accepted, move to 'Paid' and tell Inventory to reserve stock
        During(Submitted,
            When(PaymentAccepted)
                .TransitionTo(Paid)
                .Publish(context => new ReserveStock { OrderId = context.Saga.OrderId })
        );
    }
}

6. Observability: Correlation ID and Trace ID

In a distributed Saga, a single business process (like an order) spans multiple services and databases. This makes debugging and monitoring extremely difficult without the right tools. Two essential concepts for managing this complexity are Correlation IDs and Trace IDs.

A Correlation ID is a unique identifier assigned to a specific Saga instance. It links all related messages, events, and database records together across different microservices.

  • Purpose: To group all operations related to one business transaction and ensure system reliability.
  • Key Uses:
    • Observability & Tracing: Allows developers to follow a business process across multiple services and log files.
    • Idempotency & Deduplication: Prevents a service from processing the same message multiple times. If a network retry causes a message to be delivered twice, the service can use the Correlation ID to recognize it has already handled that specific request.
    • Business Auditability: Provides a verifiable trail of all steps taken during a Saga, which is essential for debugging customer issues and meeting compliance requirements.
  • Example: When Order #999 is created, the Order Service generates a Correlation ID (e.g., CORR-XYZ-123). This ID is included in every message (PaymentAccepted, StockReserved) and every log entry.
  • Why it matters: It transforms a collection of disconnected logs into a coherent story of a business transaction.

6.2 Trace ID (The Technical Path)

A Trace ID is part of Distributed Tracing (often using standards like OpenTelemetry). While a Correlation ID links business steps, a Trace ID tracks the technical execution path of a single request through the system.

  • Purpose: To measure performance, identify bottlenecks, and see the exact flow of a network request.
  • Example: When a user clicks “Submit Order,” a Trace ID is generated. As that request travels from the API Gateway to the Order Service, and then as an event to the Payment Service, the Trace ID remains the same.
  • Why it matters: It allows you to see a “Gantt chart” style view of where time was spent (e.g., “The Payment Service took 2 seconds because of a slow database query”).

7. Summary: When to use SAGA?

You should use the SAGA pattern when:

  • You need data consistency across multiple microservices.
  • You can’t use a single distributed transaction (2PC).
  • Your business process takes time (Seconds, Minutes, or even Days).

8. Further Reading & References

For those who want to dive deeper into the SAGA pattern and its implementation in the .NET ecosystem, the following resources are highly recommended:


Conclusion: The SAGA pattern is an essential tool for managing complex business logic and consistency in distributed systems. By effectively using either Choreography or Orchestration and ensuring strong observability through Correlation and Trace IDs, you can build resilient and maintainable microservices.

Leave a comment