Add parallel row copy with support for consistent checkpoints (no gaps) by olegkv · Pull Request #1727 · github/gh-ost

olegkv · 2026-06-24T17:47:08Z

Related issue: #193

Description

Parallel-Copy Feature Summary
Instead of copying table rows one chunk at a time sequentially, parallel-copy runs multiple worker goroutines that execute chunk INSERTs concurrently. Useful primarily for large tables with periodic drops in workload, e.g. nighttime activity pauses. If workload on the table drops periodically, parallel-copy allows to copy rows faster, using all available resources and reusing existing throttling.
Issues mentioned in original issue are no longer critical, MySQL 8+ handles concurrency much better than MySQL 5.7

Performance
when MySQL is not really busy (imitating nighttime workload drop) and (innodb-doublewrite=OFF, sync-binlog=1, innodb-flush-log-at-trx-commit=1) and using small table with autoincrement PK, (3 columns, 2M rows, chunk_size=1000):
serial mode 55s
1 worker 55s (can still be usable because it runs in parallel with DML applier)
2 workers 40s (x1.38)
4 workers 27s (x2)
8 workers 16s (x3.4)
16 workers 11s (x5)
32 workers 8s (x6.9)

New flags
--parallel-copy (requires --checkpoint)
--parallel-copy-workers: number of workers (default 4, max 64, recommended ≤16)
--parallel-copy-max-heartbeatlag-millis N: pause workers if gh-ost's own DML applier (HeartbeatLag) falls behind

Implementation
The parallel-copy feature is implemented as a series of additive guards if ParallelCopy { … } else { },
so, the existing logic remains in place and is exercised identically when the flag is off. Running without the flag (default option) produces the same execution as before. Every new branch either gates on {ParallelCopy = true} or replicates the original serial code in an {else}. I really tried to make it as much additive and non-invasive as possible.
Additionally, the applier should still have priority, same as in serial mode.
To solve that, parallel-copy adds one more throttle check on applier's own processing rate, using heartbeat lag
(useful when migrating on master without checking replicas).

Architecture
Boundary scans stay serial; only INSERTs are parallelized. The existing iterateChunks() producer remains the single goroutine that scans PK ranges and enqueues copy-task closures.

Frontier tracking
Since workers finish out-of-order, we can't just blindly advance the checkpoint boundary. The advanceFrontier() function in parallel.go solves this with a gap-filling algorithm:
Each chunk gets a monotone sequence number at dispatch time
Completed chunks are stored in parallelPending map[int64]rangeResult
The frontier (TotalRowsCopied, Iteration, LastIterationRange) only advances over a contiguous prefix, and it stops at the first gap. When an earlier chunk finishes, it fills the gap and releases all pending chunks above it. This guarantees that a crash and resume never leaves un-copied holes.
Resuming
On --resume, the checkpoint restores the last contiguous frontier position. resetParallelState(checkpoint.Iteration) seeds the gap-filler from that base so it correctly handles any chunks that completed but weren't committed before the crash.

Consistency
Consistency is ensured by using
-INSERT IGNORE/SELECT FOR SHARE for row copy (current gh-ost behavior)
-DELETE/INSERT by DML applier when unique key change is detected (current gh-ost behavior)
-advanceFrontier call to ensure that checkpoints have no gaps (used by parallel-copy only)
All these three points are required for parallel-copy consistency. If any of those is missing, consistency of parallel copy cannot be guaranteed (for example, row copy could insert row back after it was deleted). In the current serial mode, the problem is much easier because DML applier is never parallel with row copy.

olegkv added 3 commits June 24, 2026 12:49

parallel-copy-added

abd7b70

minor corrections

1602b0d

added more info to md file

bb42de4

olegkv requested review from meiji163 and timvaillancourt as code owners June 24, 2026 17:47

olegkv added 3 commits June 24, 2026 14:03

more consistent variable names

e7748b9

hopefully the last change in comment section

beb6a71

corrected linter error

5d9aab2

olegkv changed the title ~~Add parallel row copy with the support of consistent checkpoints (no gaps)~~ Add parallel row copy with support for consistent checkpoints (no gaps) Jun 24, 2026

removed the experiment with speeding up checkpoints

02e4e39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add parallel row copy with support for consistent checkpoints (no gaps)#1727

Add parallel row copy with support for consistent checkpoints (no gaps)#1727
olegkv wants to merge 7 commits into
github:masterfrom
olegkv:parallel-copy-olegkv

olegkv commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

olegkv commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

olegkv commented Jun 24, 2026 •

edited

Loading