Parallel and Incremental Sync
For large directories with thousands of files, standard rsync can be slow — not because of network speed, but because of file scanning time. This page covers two optimization strategies: incremental snapshots (back up only what changed) and parallel execution (run multiple rsync instances simultaneously).
Incremental Backups with --link-dest
The Problem
A daily backup of a 50 GB web application that copies everything every time:
- Transfers 50 GB daily → 350 GB/week
- Takes 30+ minutes per backup
- Wastes storage on identical files
The Solution: Hard-Linked Snapshots
--link-dest tells rsync to compare against a reference backup. Unchanged files are hard-linked to the reference (using zero extra disk space), and only changed files are actually copied.
rsync -av --link-dest=/backup/yesterday/ \
/var/www/html/ /backup/today/
How It Works
flowchart LR
SRC["Source<br/>/var/www/html/<br/>50 GB"] --> RSYNC["rsync --link-dest"]
REF["Reference<br/>/backup/yesterday/<br/>50 GB"] --> RSYNC
RSYNC --> NEW["Today's Backup<br/>/backup/today/<br/>Shows as 50 GB<br/>Actually uses ~200 MB<br/>(only changed files)"]
| File State | What Rsync Does | Disk Space Used |
|---|---|---|
| Unchanged since yesterday | Creates hard link to yesterday's copy | 0 bytes |
| Modified since yesterday | Copies the new version | Full file size |
| New file (didn't exist yesterday) | Copies the file | Full file size |
| Deleted from source | Not present in today's backup | 0 bytes |
Each backup directory looks like a complete copy (you can ls or restore from any snapshot), but unchanged files share disk blocks. A week of daily backups might use only 10% more space than a single backup.
Daily Incremental Backup Script
#!/bin/bash
# incremental-backup.sh — Daily snapshot with hard links
set -e
SOURCE="/var/www/html/"
BACKUP_BASE="/backup"
TODAY=$(date +%F)
LATEST="$BACKUP_BASE/latest"
# Create today's backup with hard links to the previous one
rsync -av --link-dest="$LATEST" \
"$SOURCE" "$BACKUP_BASE/$TODAY/"
# Update the "latest" symlink to point to today's backup
ln -sfn "$BACKUP_BASE/$TODAY" "$LATEST"
# Clean up backups older than 14 days
find "$BACKUP_BASE" -maxdepth 1 -type d -name "20*" -mtime +14 -exec rm -rf {} \;
echo " Incremental backup complete: $TODAY"
Schedule with cron:
0 2 * * * /usr/local/bin/incremental-backup.sh >> /var/log/backup.log 2>&1
Restoring from a Snapshot
Every snapshot is a complete, self-contained copy:
# List available snapshots
ls -la /backup/
# 2024-01-10/
# 2024-01-11/
# 2024-01-12/
# latest -> 2024-01-12
# Restore from any snapshot — it's a full copy
rsync -av /backup/2024-01-11/ /var/www/html/
# Or restore a specific file from 3 days ago
cp /backup/2024-01-09/uploads/important-file.pdf /var/www/html/uploads/
Storage Efficiency
For a 50 GB web application with ~1% daily changes:
| Period | Full Copy Backups | Incremental (--link-dest) |
|---|---|---|
| 1 day | 50 GB | 50 GB |
| 7 days | 350 GB | ~53 GB |
| 30 days | 1.5 TB | ~65 GB |
| 90 days | 4.5 TB | ~95 GB |
Parallel Rsync
The Problem
Rsyncing a directory with 100,000+ files takes a long time just to scan and compare files, even before any data transfers. This bottleneck is I/O-bound and single-threaded.
The Solution: Split and Parallelize
Run multiple rsync processes on separate subdirectories simultaneously:
Using GNU Parallel
# Install if needed
sudo apt install parallel
# Sync upload directories in parallel (4 jobs)
find /var/www/html/uploads/ -mindepth 1 -maxdepth 1 -type d \
| parallel -j4 rsync -av {} user@backup:/backups/uploads/{/}/
# More granular: sync year/month directories
find /var/www/html/uploads/ -mindepth 2 -maxdepth 2 -type d \
| parallel -j8 rsync -av {} user@backup:/backups/uploads/{= s:^.*/uploads/::; =}/
Using xargs
# 4 parallel rsync jobs
find /var/www/html/uploads/ -mindepth 1 -maxdepth 1 -type d -print0 \
| xargs -0 -n1 -P4 -I{} rsync -av {} user@backup:/backups/uploads/
When to Use Parallel Rsync
| Scenario | Use Parallel? | Recommended Jobs |
|---|---|---|
| < 10,000 files | No — standard rsync is fast enough | 1 |
| 10,000 – 100,000 files | Maybe — if scan time is long | 2–4 |
| 100,000+ files | Yes — significant speedup | 4–8 |
| 1,000,000+ files | Definitely — massive improvement | 8–16 |
| Files on SSD | Higher parallelism safe | 8–16 |
| Files on HDD | Careful — too many jobs thrashes disk | 2–4 |
Too many parallel jobs can overwhelm disk I/O, especially on HDDs. Start with 4 parallel jobs and increase only if CPU and disk utilization have headroom. Monitor with htop and iotop.
Combining: Parallel + Incremental
The most powerful approach — parallel execution with hard-linked incremental snapshots:
#!/bin/bash
# parallel-incremental-backup.sh
set -e
SOURCE="/var/www/html/"
BACKUP_BASE="/backup"
TODAY=$(date +%F)
LATEST="$BACKUP_BASE/latest"
JOBS=4
# Create backup directory structure
mkdir -p "$BACKUP_BASE/$TODAY"
# Parallel incremental rsync of top-level directories
find "$SOURCE" -mindepth 1 -maxdepth 1 -type d -print0 \
| xargs -0 -n1 -P$JOBS -I{} rsync -av \
--link-dest="$LATEST/{/}" \
{} "$BACKUP_BASE/$TODAY/"
# Also sync root-level files (not parallel, they're few)
rsync -av --link-dest="$LATEST/" \
--exclude='*/' \
"$SOURCE" "$BACKUP_BASE/$TODAY/"
# Update latest symlink
ln -sfn "$BACKUP_BASE/$TODAY" "$LATEST"
# Rotate old backups (keep 14 days)
find "$BACKUP_BASE" -maxdepth 1 -type d -name "20*" -mtime +14 -exec rm -rf {} \;
echo " Parallel incremental backup complete: $TODAY"
Rsync's Built-In Incremental Behavior
Even without --link-dest, rsync is already incremental in how it transfers data:
| Feature | Behavior |
|---|---|
| File comparison | Skips files with matching size + timestamp |
| Delta transfer | Sends only changed blocks of modified files |
--ignore-existing | Skip files that already exist at destination |
--update | Skip files that are newer at destination |
# Standard rsync — already skips unchanged files
rsync -avz /var/www/ user@backup:/backups/www/
# Only add new files (never overwrite)
rsync -av --ignore-existing /var/www/ user@backup:/backups/www/
# Only update if source is newer
rsync -av --update /var/www/ user@backup:/backups/www/
The difference with --link-dest is that each run creates a separate snapshot directory — giving you full point-in-time recovery capabilities.
Common Pitfalls
| Pitfall | Impact | Prevention |
|---|---|---|
| Too many parallel jobs on HDD | Disk thrashing slows everything down | Start with 4, monitor with iotop |
Forgetting to update the latest symlink | --link-dest has no reference, copies everything | Always ln -sfn after backup |
| Not rotating old snapshots | Disk fills up even with hard links (metadata) | find -mtime +N -exec rm -rf |
| Parallel rsync on same directory | Race conditions, corrupted backup | Split into non-overlapping subdirectories |
| Not testing restore from snapshots | Unknown backup quality | Periodically restore to staging |
Quick Reference
# Incremental backup with link-dest
rsync -av --link-dest=/backup/latest/ /var/www/ /backup/$(date +%F)/
ln -sfn /backup/$(date +%F) /backup/latest
# Parallel rsync (4 jobs)
find /var/www/uploads/ -maxdepth 1 -type d \
| parallel -j4 rsync -av {} backup:/backups/uploads/{/}/
# Incremental — only new files
rsync -av --ignore-existing /var/www/ backup:/backups/
# Check hard link counts (verify link-dest is working)
stat /backup/2024-01-15/index.php | grep Links
What's Next
- Compression and Bandwidth — Optimize transfer performance
- Backup Strategies — Complete backup architecture
- Cron Automation — Schedule automated backups