Database Replication in Payment Systems
What every backend engineer should know about primary-replica architecture, and where it breaks

Database replication is one of those concepts that appears simple until you're staring at a failed primary at 2am with live payment traffic hitting a dead machine.
Most engineers understand the basic idea: one primary database handles writes, one or more replicas handle reads, and data is continuously copied between them. What's less understood is the nuance, replication lag, the promotion problem, and how your application code must account for all of it.
This article covers all of it, with fintech-specific examples throughout.
Why replication exists
A single database instance is both a performance bottleneck and a single point of failure. At any meaningful scale, these two facts become urgent problems simultaneously.
In a typical payment platform, database traffic breaks down roughly as:
95% reads — balance checks, transaction history, KYC status lookups, fraud checks
5% writes — new transactions, status updates, user profile changes
Running all of this through a single database means reads and writes compete for the same CPU, memory, and disk I/O. A heavy compliance reporting job scanning millions of rows can starve your live payment processing of resources. That's unacceptable in production.
Replication separates these concerns. Writes go to one machine. Reads distribute across others. Your primary database focuses exclusively on what only it can do: accept new data.
How primary-replica replication works
The primary database maintains a write-ahead log (WAL), a sequential record of every change made to the data. Replicas connect to the primary and continuously stream this log, applying each change in order to maintain an identical copy of the data.
Your application is responsible for routing correctly:
All INSERT, UPDATE, DELETE operations go to the primary
SELECT queries go to replicas
The database itself does not enforce this routing. It is entirely your application's responsibility. Misconfigured routing — sends all traffic to primary, or inadvertently sends writes to a replica — is one of the most common production database problems.
Replication lag — the silent danger in payment systems
Replication is asynchronous by default. The primary writes data to its own disk, confirms success to your application, and then streams the change to replicas afterwards. There is always a delay between a write landing on the primary and that same write appearing on the replicas.
Under normal conditions this lag is milliseconds. Under heavy load it can grow to seconds. This creates a dangerous pattern:
User initiates a ₦50,000 transfer
Write hits primary — transaction recorded successfully
User's app immediately fetches transaction history
Read is routed to a replica — replication hasn't caught up yet
Transfer doesn't appear in history
User assumes the payment failed and retries
Double payment
The fix: read-your-own-writes consistency
Any read that occurs within the same user session as a recent write must go to the primary, not a replica. The primary is guaranteed to have the latest data. Replicas are not.
In practice this means: historical reads, transaction history from last week, account reports, dashboard analytics, safely go to replicas. Reads that immediately follow a write, fetching a transaction you just created, checking a balance you just updated — must go to the primary.
In Spring Boot with JPA, this is implemented by annotating read-only operations with @Transactional(readOnly = true) and routing them to the replica datasource, while keeping all operations that require fresh data on the primary datasource via @Transactional.
Asynchronous vs synchronous replication
Most replicas use asynchronous replication — the primary confirms success to your application before the replica has received the data. Fast, but carries the risk of data loss if the primary dies before the replica catches up.
Synchronous replication works differently. The primary sends each write to the standby and waits for confirmation that the standby has written it to disk before acknowledging success to your application. Both machines have the data before your code moves to the next line.
The tradeoff:
Asynchronous: faster writes, possible data loss on primary failure
Synchronous: 5-10ms added to every write, zero data loss on primary failure
For payment systems, the correct answer is synchronous replication for your high-availability standby. Five milliseconds per write is an acceptable cost when the alternative is losing financial transactions on failover. Use asynchronous replication for read replicas used in reporting and analytics — data that is a few milliseconds stale is fine for dashboards.
What actually happens when your primary goes offline
When a primary database fails, the recovery path is promotion — elevating a replica to become the new primary and redirecting all write traffic to it. This sounds simple. In production, it is not.
The promotion problem
At the exact moment of primary failure, your replica is almost certainly behind. The gap might be five milliseconds. It might be two seconds if replication was under stress. That gap represents real transactions — real money movements — that exist only on the dead primary.
A responsible promotion sequence requires:
Detect that the primary is unreachable — not just slow, but genuinely offline
Assess replication lag — how many transactions are missing on the replica?
Attempt log recovery — if the primary's disk is accessible, extract the missing WAL entries and apply them to the replica before promoting
Promote the replica — update application configuration to send writes to the new primary
Spin up a new replica — the newly promoted primary now has no standby of its own and is a single point of failure again until one is provisioned
Steps 2 and 3 are where most organisations run into trouble. In a fire scenario, engineers are under pressure to restore service as quickly as possible. Skipping log recovery to speed up promotion means accepting potential data loss. That decision needs to be made in advance — not at 2am during an incident.
This is the core reason to use synchronous replication for your standby. With synchronous replication, step 3 becomes unnecessary — the standby is always fully caught up by definition. Promotion is clean, fast, and carries no data loss risk.
Replication is not a backup
This distinction matters enormously and is frequently misunderstood.
A replica is a live mirror of your primary. It replicates everything — including mistakes. If someone executes a destructive query against your primary, that deletion replicates to every replica within milliseconds. Your replicas are now also corrupted.
A backup is a point-in-time snapshot of your data. It lets you restore to any historical state, before the bad query ran. Backups and replication serve different purposes:
Replication: survive infrastructure failure (a server dies)
Backups: survive human error (bad data was written)
A production payment system needs both. Replication for availability. Automated point-in-time backups for recoverability. Never treat one as a substitute for the other.
Summary
Primary-replica replication separates read and write traffic for performance and reliability
Replication lag means reads immediately after writes must go to the primary, not replicas
Use synchronous replication for your high-availability standby, asynchronous for read replicas
Promotion is more complex than it sounds — always assess lag and attempt log recovery before promoting
Replication is not a backup: you need both, and they solve different problems

