Skip to content

Add parallel row copy with support for consistent checkpoints (no gaps)#1727

Open
olegkv wants to merge 7 commits into
github:masterfrom
olegkv:parallel-copy-olegkv
Open

Add parallel row copy with support for consistent checkpoints (no gaps)#1727
olegkv wants to merge 7 commits into
github:masterfrom
olegkv:parallel-copy-olegkv

Conversation

@olegkv

@olegkv olegkv commented Jun 24, 2026

Copy link
Copy Markdown

Related issue: #193

Description

Parallel-Copy Feature Summary
Instead of copying table rows one chunk at a time sequentially, parallel-copy runs multiple worker goroutines that execute chunk INSERTs concurrently. Useful primarily for large tables with periodic drops in workload, e.g. nighttime activity pauses. If workload on the table drops periodically, parallel-copy allows to copy rows faster, using all available resources and reusing existing throttling.
Issues mentioned in original issue are no longer critical, MySQL 8+ handles concurrency much better than MySQL 5.7

Performance
when MySQL is not really busy (imitating nighttime workload drop) and (innodb-doublewrite=OFF, sync-binlog=1, innodb-flush-log-at-trx-commit=1) and using small table with autoincrement PK, (3 columns, 2M rows, chunk_size=1000):
serial mode 55s
1 worker 55s (can still be usable because it runs in parallel with DML applier)
2 workers 40s (x1.38)
4 workers 27s (x2)
8 workers 16s (x3.4)
16 workers 11s (x5)
32 workers 8s (x6.9)

New flags
--parallel-copy (requires --checkpoint)
--parallel-copy-workers: number of workers (default 4, max 64, recommended ≤16)
--parallel-copy-max-heartbeatlag-millis N: pause workers if gh-ost's own DML applier (HeartbeatLag) falls behind

Implementation
The parallel-copy feature is implemented as a series of additive guards if ParallelCopy { … } else { },
so, the existing logic remains in place and is exercised identically when the flag is off. Running without the flag (default option) produces the same execution as before. Every new branch either gates on {ParallelCopy = true} or replicates the original serial code in an {else}. I really tried to make it as much additive and non-invasive as possible.
Additionally, the applier should still have priority, same as in serial mode.
To solve that, parallel-copy adds one more throttle check on applier's own processing rate, using heartbeat lag
(useful when migrating on master without checking replicas).

Architecture
Boundary scans stay serial; only INSERTs are parallelized. The existing iterateChunks() producer remains the single goroutine that scans PK ranges and enqueues copy-task closures.

Frontier tracking
Since workers finish out-of-order, we can't just blindly advance the checkpoint boundary. The advanceFrontier() function in parallel.go solves this with a gap-filling algorithm:
Each chunk gets a monotone sequence number at dispatch time
Completed chunks are stored in parallelPending map[int64]rangeResult
The frontier (TotalRowsCopied, Iteration, LastIterationRange
) only advances over a contiguous prefix, and it stops at the first gap. When an earlier chunk finishes, it fills the gap and releases all pending chunks above it. This guarantees that a crash and resume never leaves un-copied holes.
Resuming
On --resume, the checkpoint restores the last contiguous frontier position. resetParallelState(checkpoint.Iteration) seeds the gap-filler from that base so it correctly handles any chunks that completed but weren't committed before the crash.

Consistency
Consistency is ensured by using
-INSERT IGNORE/SELECT FOR SHARE for row copy (current gh-ost behavior)
-DELETE/INSERT by DML applier when unique key change is detected (current gh-ost behavior)
-advanceFrontier call to ensure that checkpoints have no gaps (used by parallel-copy only)
All these three points are required for parallel-copy consistency. If any of those is missing, consistency of parallel copy cannot be guaranteed (for example, row copy could insert row back after it was deleted). In the current serial mode, the problem is much easier because DML applier is never parallel with row copy.

@olegkv olegkv changed the title Add parallel row copy with the support of consistent checkpoints (no gaps) Add parallel row copy with support for consistent checkpoints (no gaps) Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant