Technical whitepaper — PodHeitor HPC Backup Plugin for Bacula. Internal architecture, parallelism model, sharding, integration with parallel filesystems (Lustre, GPFS, BeeGFS, CephFS, WekaFS), field benchmarks, and deployment topologies for HPC clusters with billion-file namespaces.
Technical companion to the PodHeitor HPC plugin page. For the executive PDF: PodHeitor HPC Whitepaper (PDF).
1. Problem: Bacula was not designed for HPC
The stock Bacula File Daemon walks the filesystem single-threaded. In findlib/find_one.c, recursion is breaddir/readdir followed by synchronous save_file(). That design is reasonable for a typical application server — but on a Lustre filesystem with one billion files, the arithmetic does not work:
- Average
readdir+lstatcost on warm Lustre: ~150 µs per file (bounded by RPC to the MDT). - 1 × 10⁹ files × 150 µs = ~41 hours of metadata scan alone, before reading a single byte of data.
- Bacula Enterprise 18.2 ships plugins for HDFS, Quobyte, NDMP, NetApp and Nutanix — none for the parallel filesystems that actually run modern HPC.
The PodHeitor HPC Backup Plugin closes that gap with aggressive in-job parallelism, cross-job namespace sharding, and per-filesystem changelog drivers.
2. Architectural model
The plugin follows the PodHeitor pattern of cdylib + standalone backend, with PTCOMM (length-tagged framing on stdin/stdout) between the two processes. The motivation is threefold:
- Crash isolation. A panic in the walker kills the backend, not
bacula-fd. The cdylib observes EOF on the pipe, reports the job as failed, and the FD continues handling other jobs. - Parallelism freedom. The backend can spawn arbitrary rayon threads without violating Bacula’s “one thread per
bpContext” contract. - License firewall. Bacula’s loader gates on
info->plugin_licenseat runtime. The subprocess never touches Bacula’s ABI; only the cdylib does, and exclusively through thebacula-fd-abicrate with independentextern "C"declarations — no Bacula AGPLv3 source statically linked.
2.1 Crate map
| Crate | Role | Phase |
|---|---|---|
hpc-cdylib |
.so loaded by bacula-fd. Thin metaplugin-rs adapter. |
1 |
hpc-backend |
Standalone Rust binary. PTCOMM producer; hosts parallel walker, sharder, changelog/stripe drivers. | 1 |
hpc-walker |
Parallel POSIX walker (rayon work-stealing + bounded crossbeam channel). | 2 |
hpc-shard |
Namespace sharding strategies. | 3 |
hpc-changelog |
Lustre/GPFS/CephFS/BeeGFS drivers behind cargo features. | 4-6 |
hpc-stripe |
Stripe-aware parallel readers (Lustre llapi, GPFS NSD). | 4-5 |
hpc-ptcomm |
Wire format between cdylib and backend. | 1 |
hpc-cli |
Operator CLI podheitor-hpc (gen-fileset, shard-probe, bench). |
3 |
2.2 Threading model
bacula-fd— one thread per job. The cdylib runs on this thread and must never block it for long.hpc-cdylib— single-threaded by Bacula contract. Drains PTCOMM frames synchronously and translates each frame into asave_pkt/pluginIOcycle.hpc-backend— spawns a rayon thread pool sized tonproc. All metadata work is parallel. A single single-threaded PTCOMM emitter consumes the bounded MPSC channel and writes to stdout — that is the only serialization point on the producer side.
The practical result: the bottleneck shifts from the FD walker to the consumer (the FD’s own thread draining PTCOMM). The backend produces at filesystem line rate, and the FD becomes the limit — a much higher limit.
3. Why sharding is mandatory at HPC scale
Even with perfect in-backend parallelism, there is a ceiling: bacula-fd holds a single SD connection per job. Bacula 15.x has no within-job stream multiplexing. The only way to multiply outbound throughput is to run N concurrent jobs, with Maximum Concurrent Jobs applied at Director, Job, Client, and Storage.
The hpc-shard crate makes this an operator dial:
podheitor-hpc gen-fileset --strategy inode-hash --shards 4
--path /lustre/scratch --output /opt/bacula/etc/conf.d/
Produces 4 FileSets and 4 Job snippets, each with:
Plugin = "podheitor-hpc:path=/lustre/scratch;shard=K/4;mode=lustre"
Run them concurrently for 4× outbound SD streams.
3.1 Sharding strategies
| Strategy | When to use | Notes |
|---|---|---|
None |
Single shard, small dataset, manual config | Phase 0 default |
InodeHash { shard, of } |
General parallel default — uniform distribution | xxh3-based; deterministic across runs |
Subtree { path } |
Different subtrees deserve different policies | Composes well with per-subtree InodeHash |
MtimeBucket { shard, of } |
Incrementals across many shards | Pair with Level = Incremental |
LustreMdt { shard, of } |
Lustre — pin one shard per MDT | Reads MDT info from lfs df -i. Falls back to InodeHash if absent. |
3.2 The serialized-volume gotcha
A detail that destroys naïve sharding gains: the Bacula SD serializes block writes per volume. If all N shards land on the same volume, you get N parallel FD walkers feeding 1 serial SD writer — you get correctness (parity, union coverage) but not wall-time speedup.
Phase 3 lab measurement captured exactly this: 4 shards on the same Volume → 13s vs single-stream baseline of 10s (i.e., slower). To unlock the speedup the architecture promises, choose one of three topologies:
| Topology | Setup | Tradeoff |
|---|---|---|
| Per-shard Pool | One Pool per shard, with its own volume sequence (LabelFormat = "Shard-{N}-"). gen-fileset emits a Pool snippet alongside FileSet/Job. |
Cleanest; volumes never collide. |
MaximumVolumeJobs = 1 |
Set on the existing default Pool. Bacula allocates a fresh volume per job. | Lighter-weight; produces lots of small volumes, harder to manage long-term. |
| Multiple Storages → distinct devices | Round-robin gen-fileset --storage File1,File2,File3,File4. Each Storage points at a different SD Device, with its own volume pool. |
Best when SD has multiple disks/spindles to fan out to. |
4. Changelog drivers (true incrementals)
The leap from “full scan” to “real incremental” depends on not making 10⁹ stat() calls. Each parallel filesystem exposes its own changelog mechanism; the hpc-changelog crate abstracts these behind the ChangelogSource trait:
| Filesystem | Mechanism | What it captures |
|---|---|---|
| Lustre 2.14+ | lfs changelog |
create/unlink/rename/rmdir/setattr per-MDT, with consumer cookie registered to ensure correct slot release |
| IBM Spectrum Scale (GPFS) 5.x | mmapplypolicy with policy template |
Files with MODIFICATION_TIME > X via GPFS-native parallel metadata scan |
| CephFS | Recursive rstats + rctime |
Whole subtrees can be pruned when rctime <= last_run — logarithmic cut on the tree |
| BeeGFS 7.x | Metadata-shard scan | Each metadata target is walked in parallel; no formal changelog, but the shard scan is already highly parallelizable |
| POSIX (fallback) | PosixWalk with mtime >= last_run |
When none of the above is available |
5. Stripe-aware parallel reader
On Lustre, a “large” file is striped across N OSTs. Reading it sequentially via the read() syscall on the client generates serialized RPCs per OST — wasting native parallelism. The hpc-stripe crate uses llapi_layout_get_by_path to discover the stripe layout and emits N concurrent reads (one per OST), reassembling in-order through PTCOMM.
On GPFS the equivalent is the NSD API; on CephFS/BeeGFS/WekaFS the reader falls back to per-chunk parallel POSIX. Typical gain on files > 1 GiB with stripe count = 4: 3.2-3.8× over sequential reads (theoretical limit 4×, the gap is reassembly overhead).
6. Field benchmarks (Phase 10)
Benchmarks live in the project’s STRESS_RESULTS.md and are reproducible via scripts/stress-1m-files.sh. Representative result:
| Date | FS | Mode | Files | Walker | Backup | Throughput |
|---|---|---|---|---|---|---|
| 2026-05-03 | Lustre (1-OST lab) | lustre | 100,000 | 9.97 s | 135.0 s | 740 files/s |
Critical interpretation notes:
- Walker isolated: ~10K files/s — the Phase 2 ratio (7× faster than
find_one_file) holds, because the walker’s logic is identical on ext4 and Lustre. - Full backup (walker + read + PTCOMM): ~740 files/s on this hardware — read-bound on a single OST. In production with 4-8 OSTs and the stripe-aware reader, throughput scales linearly until the MDT RPC ceiling.
- Extrapolation: 10M files ≈ 3h45m on the same single-OST hardware — well inside the 8h overnight window from Phase 10’s original AC. On real HPC hardware (8 OSTs, 32 vCPU, 128 GB RAM on the FD), expect 30-90 minutes.
- Zero errors, zero skipped files on catalog-vs-
findparity validation.
6.1 Memory budget
The MPSC channel is bounded (default channel_depth = 8192 entries). Each entry carries metadata + a small file prefix; large files stream in chunks, not buffered. Target backend RSS:
- < 256 MiB at 1 M files
- < 1 GiB at 1 B files
If RSS grows, the suspect is almost always the channel_depth × avg entry size product. Lower channel_depth, not workers (lowering workers slows down without saving RAM).
7. Phase 7 — running the scan on a compute node via Slurm
Default: the backend runs in the same process the cdylib spawned (on the FD host). Adding scheduler=slurm to the plugin command moves the parallel walker / changelog drain / stripe reader onto a Slurm allocation, while the cdylib stays on the login node:
Plugin = "podheitor-hpc:path=/lustre/scratch;shard=0/4;mode=lustre;
scheduler=slurm;partition=backup;cpus=8;mem=16G;time=1h"
End-to-end flow:
- The cdylib parses the command and
execspodheitor-hpc-backend backup --scheduler=slurm --partition=backup --cpus=8 .... - That backend instance becomes the submitter: binds an AF_UNIX socket on the shared FS (under the FileSet root, or
$PODHEITOR_HPC_SHARED_DIR), then asksSlurmDrivertosbatch --wrapa worker invocation onto the cluster. SchedulerDriver::submitreturns the Slurm JobId; the submitter blocks onaccept(2).- The Slurm-allocated worker connects to the socket,
dup2‘s the socket fd over its stdout, and runs the samebackupcodepath — PTCOMM frames flow into the socket as if it were stdout. - The submitter relays bytes verbatim onto its own stdout, which the cdylib’s
FilePullerdrains. The relay is opaque to PTCOMM framing; cap is whatever the socket buffer + the 64 KiBrelay()shovel can sustain (≥ 5 GiB/s on 10 GbE).
7.1 Slurm-aware throttle
The podheitor-hpc throttle-daemon polls squeue -t R -o "%Q" every 10s; on contention it rewrites /var/lib/podheitor-hpc/throttle.toml. Every running worker watches that file’s mtime on a 5s poll and applies new {rayon_threads, read_buffer_kib, pause_ms} mid-run. The TOML is write-then-rename, so readers never see torn content.
Why a file (not a Unix socket or signal):
- Operators can
catand edit it during incidents. - Survives daemon restart.
- One-writer/many-readers needs zero protocol.
- We already pay 5s of jitter — adding
inotifywould gate on filesystems that don’t propagate it cross-mount (some Lustre client configurations).
8. Restripe-on-restore
Backup preserves bytes — but in HPC, preserving striping is just as important. A 100 GB file striped across 8 OSTs and restored to 1 OST loses 8× read bandwidth. The plugin captures the original layout via llapi_layout_get_by_path and serializes it as a Bacula RestoreObject. On restore, before the first byte of data, the cdylib applies llapi_layout_file_create to recreate the exact stripe layout — then writes bytes normally.
9. Documented anti-patterns
- Don’t run the plugin against
/of a parallel filesystem. Always pin to a subtree (Subtree { path }) or shard withinode-hash. Backing up MDT root inodes risks contention with production metadata RPCs. - Don’t unset
panic = "abort". A panic across FFI throughpluginIOis UB. The release profile aborts — intended behavior. - Don’t run the FD on a Lustre OSS / GPFS NSD server. The plugin is an HPC client; deploying on storage nodes is supported only via the Slurm submit path (Phase 7) where the scan runs in a compute job.
10. License posture
The plugin ships under LicenseRef-PodHeitor-Proprietary. It does not statically link any Bacula AGPLv3 source. The binding is exclusively via independent extern "C" declarations in the bacula-fd-abi crate. The info->plugin_license = "Bacula AGPLv3" field satisfies only the FD loader’s runtime gate — there is no transitive AGPL.
Ready to evaluate?
Free 30-day trial for qualified HPC workloads (Lustre, GPFS, BeeGFS, CephFS, WekaFS). We commit to at least 50% discount vs Bacula Enterprise, Veeam or Commvault, with more capabilities included.
Heitor Faria — Founder, PodHeitor International
✉ [email protected]
☎ +1 (789) 726-1749 · +55 (61) 98268-4220 (WhatsApp)
🔗 PodHeitor HPC plugin page · Whitepaper PDF (executive)
Disponível em:
Português (Portuguese (Brazil))
English
Español (Spanish)