Skip to main content

Parallel and Incremental Sync

For large directories with thousands of files, standard rsync can be slow — not because of network speed, but because of file scanning time. This page covers two optimization strategies: incremental snapshots (back up only what changed) and parallel execution (run multiple rsync instances simultaneously).

The Problem

A daily backup of a 50 GB web application that copies everything every time:

  • Transfers 50 GB daily → 350 GB/week
  • Takes 30+ minutes per backup
  • Wastes storage on identical files

The Solution: Hard-Linked Snapshots

--link-dest tells rsync to compare against a reference backup. Unchanged files are hard-linked to the reference (using zero extra disk space), and only changed files are actually copied.

rsync -av --link-dest=/backup/yesterday/ \
/var/www/html/ /backup/today/

How It Works

flowchart LR
SRC["Source<br/>/var/www/html/<br/>50 GB"] --> RSYNC["rsync --link-dest"]
REF["Reference<br/>/backup/yesterday/<br/>50 GB"] --> RSYNC
RSYNC --> NEW["Today's Backup<br/>/backup/today/<br/>Shows as 50 GB<br/>Actually uses ~200 MB<br/>(only changed files)"]
File StateWhat Rsync DoesDisk Space Used
Unchanged since yesterdayCreates hard link to yesterday's copy0 bytes
Modified since yesterdayCopies the new versionFull file size
New file (didn't exist yesterday)Copies the fileFull file size
Deleted from sourceNot present in today's backup0 bytes
tip

Each backup directory looks like a complete copy (you can ls or restore from any snapshot), but unchanged files share disk blocks. A week of daily backups might use only 10% more space than a single backup.

Daily Incremental Backup Script

#!/bin/bash
# incremental-backup.sh — Daily snapshot with hard links
set -e

SOURCE="/var/www/html/"
BACKUP_BASE="/backup"
TODAY=$(date +%F)
LATEST="$BACKUP_BASE/latest"

# Create today's backup with hard links to the previous one
rsync -av --link-dest="$LATEST" \
"$SOURCE" "$BACKUP_BASE/$TODAY/"

# Update the "latest" symlink to point to today's backup
ln -sfn "$BACKUP_BASE/$TODAY" "$LATEST"

# Clean up backups older than 14 days
find "$BACKUP_BASE" -maxdepth 1 -type d -name "20*" -mtime +14 -exec rm -rf {} \;

echo " Incremental backup complete: $TODAY"

Schedule with cron:

0 2 * * * /usr/local/bin/incremental-backup.sh >> /var/log/backup.log 2>&1

Restoring from a Snapshot

Every snapshot is a complete, self-contained copy:

# List available snapshots
ls -la /backup/
# 2024-01-10/
# 2024-01-11/
# 2024-01-12/
# latest -> 2024-01-12

# Restore from any snapshot — it's a full copy
rsync -av /backup/2024-01-11/ /var/www/html/

# Or restore a specific file from 3 days ago
cp /backup/2024-01-09/uploads/important-file.pdf /var/www/html/uploads/

Storage Efficiency

For a 50 GB web application with ~1% daily changes:

PeriodFull Copy BackupsIncremental (--link-dest)
1 day50 GB50 GB
7 days350 GB~53 GB
30 days1.5 TB~65 GB
90 days4.5 TB~95 GB

Parallel Rsync

The Problem

Rsyncing a directory with 100,000+ files takes a long time just to scan and compare files, even before any data transfers. This bottleneck is I/O-bound and single-threaded.

The Solution: Split and Parallelize

Run multiple rsync processes on separate subdirectories simultaneously:

Using GNU Parallel

# Install if needed
sudo apt install parallel

# Sync upload directories in parallel (4 jobs)
find /var/www/html/uploads/ -mindepth 1 -maxdepth 1 -type d \
| parallel -j4 rsync -av {} user@backup:/backups/uploads/{/}/

# More granular: sync year/month directories
find /var/www/html/uploads/ -mindepth 2 -maxdepth 2 -type d \
| parallel -j8 rsync -av {} user@backup:/backups/uploads/{= s:^.*/uploads/::; =}/

Using xargs

# 4 parallel rsync jobs
find /var/www/html/uploads/ -mindepth 1 -maxdepth 1 -type d -print0 \
| xargs -0 -n1 -P4 -I{} rsync -av {} user@backup:/backups/uploads/

When to Use Parallel Rsync

ScenarioUse Parallel?Recommended Jobs
< 10,000 filesNo — standard rsync is fast enough1
10,000 – 100,000 filesMaybe — if scan time is long2–4
100,000+ filesYes — significant speedup4–8
1,000,000+ filesDefinitely — massive improvement8–16
Files on SSDHigher parallelism safe8–16
Files on HDDCareful — too many jobs thrashes disk2–4
warning

Too many parallel jobs can overwhelm disk I/O, especially on HDDs. Start with 4 parallel jobs and increase only if CPU and disk utilization have headroom. Monitor with htop and iotop.

Combining: Parallel + Incremental

The most powerful approach — parallel execution with hard-linked incremental snapshots:

#!/bin/bash
# parallel-incremental-backup.sh
set -e

SOURCE="/var/www/html/"
BACKUP_BASE="/backup"
TODAY=$(date +%F)
LATEST="$BACKUP_BASE/latest"
JOBS=4

# Create backup directory structure
mkdir -p "$BACKUP_BASE/$TODAY"

# Parallel incremental rsync of top-level directories
find "$SOURCE" -mindepth 1 -maxdepth 1 -type d -print0 \
| xargs -0 -n1 -P$JOBS -I{} rsync -av \
--link-dest="$LATEST/{/}" \
{} "$BACKUP_BASE/$TODAY/"

# Also sync root-level files (not parallel, they're few)
rsync -av --link-dest="$LATEST/" \
--exclude='*/' \
"$SOURCE" "$BACKUP_BASE/$TODAY/"

# Update latest symlink
ln -sfn "$BACKUP_BASE/$TODAY" "$LATEST"

# Rotate old backups (keep 14 days)
find "$BACKUP_BASE" -maxdepth 1 -type d -name "20*" -mtime +14 -exec rm -rf {} \;

echo " Parallel incremental backup complete: $TODAY"

Rsync's Built-In Incremental Behavior

Even without --link-dest, rsync is already incremental in how it transfers data:

FeatureBehavior
File comparisonSkips files with matching size + timestamp
Delta transferSends only changed blocks of modified files
--ignore-existingSkip files that already exist at destination
--updateSkip files that are newer at destination
# Standard rsync — already skips unchanged files
rsync -avz /var/www/ user@backup:/backups/www/

# Only add new files (never overwrite)
rsync -av --ignore-existing /var/www/ user@backup:/backups/www/

# Only update if source is newer
rsync -av --update /var/www/ user@backup:/backups/www/

The difference with --link-dest is that each run creates a separate snapshot directory — giving you full point-in-time recovery capabilities.

Common Pitfalls

PitfallImpactPrevention
Too many parallel jobs on HDDDisk thrashing slows everything downStart with 4, monitor with iotop
Forgetting to update the latest symlink--link-dest has no reference, copies everythingAlways ln -sfn after backup
Not rotating old snapshotsDisk fills up even with hard links (metadata)find -mtime +N -exec rm -rf
Parallel rsync on same directoryRace conditions, corrupted backupSplit into non-overlapping subdirectories
Not testing restore from snapshotsUnknown backup qualityPeriodically restore to staging

Quick Reference

# Incremental backup with link-dest
rsync -av --link-dest=/backup/latest/ /var/www/ /backup/$(date +%F)/
ln -sfn /backup/$(date +%F) /backup/latest

# Parallel rsync (4 jobs)
find /var/www/uploads/ -maxdepth 1 -type d \
| parallel -j4 rsync -av {} backup:/backups/uploads/{/}/

# Incremental — only new files
rsync -av --ignore-existing /var/www/ backup:/backups/

# Check hard link counts (verify link-dest is working)
stat /backup/2024-01-15/index.php | grep Links

What's Next