Technical whitepaper — PodHeitor HPC Backup Plugin for Bacula

Technical whitepaper — PodHeitor HPC Backup Plugin for Bacula. Internal architecture, parallelism model, sharding, integration with parallel filesystems (Lustre, GPFS, BeeGFS, CephFS, WekaFS), field benchmarks, and deployment topologies for HPC clusters with billion-file namespaces.

Technical companion to the PodHeitor HPC plugin page. For the executive PDF: PodHeitor HPC Whitepaper (PDF).

1. Problem: Bacula was not designed for HPC

The stock Bacula File Daemon walks the filesystem single-threaded. In findlib/find_one.c, recursion is breaddir/readdir followed by synchronous save_file(). That design is reasonable for a typical application server — but on a Lustre filesystem with one billion files, the arithmetic does not work:

  • Average readdir+lstat cost on warm Lustre: ~150 µs per file (bounded by RPC to the MDT).
  • 1 × 10⁹ files × 150 µs = ~41 hours of metadata scan alone, before reading a single byte of data.
  • Bacula Enterprise 18.2 ships plugins for HDFS, Quobyte, NDMP, NetApp and Nutanix — none for the parallel filesystems that actually run modern HPC.

The PodHeitor HPC Backup Plugin closes that gap with aggressive in-job parallelism, cross-job namespace sharding, and per-filesystem changelog drivers.

2. Architectural model

The plugin follows the PodHeitor pattern of cdylib + standalone backend, with PTCOMM (length-tagged framing on stdin/stdout) between the two processes. The motivation is threefold:

  1. Crash isolation. A panic in the walker kills the backend, not bacula-fd. The cdylib observes EOF on the pipe, reports the job as failed, and the FD continues handling other jobs.
  2. Parallelism freedom. The backend can spawn arbitrary rayon threads without violating Bacula’s “one thread per bpContext” contract.
  3. License firewall. Bacula’s loader gates on info->plugin_license at runtime. The subprocess never touches Bacula’s ABI; only the cdylib does, and exclusively through the bacula-fd-abi crate with independent extern "C" declarations — no Bacula AGPLv3 source statically linked.

2.1 Crate map

Crate Role Phase
hpc-cdylib .so loaded by bacula-fd. Thin metaplugin-rs adapter. 1
hpc-backend Standalone Rust binary. PTCOMM producer; hosts parallel walker, sharder, changelog/stripe drivers. 1
hpc-walker Parallel POSIX walker (rayon work-stealing + bounded crossbeam channel). 2
hpc-shard Namespace sharding strategies. 3
hpc-changelog Lustre/GPFS/CephFS/BeeGFS drivers behind cargo features. 4-6
hpc-stripe Stripe-aware parallel readers (Lustre llapi, GPFS NSD). 4-5
hpc-ptcomm Wire format between cdylib and backend. 1
hpc-cli Operator CLI podheitor-hpc (gen-fileset, shard-probe, bench). 3

2.2 Threading model

  • bacula-fd — one thread per job. The cdylib runs on this thread and must never block it for long.
  • hpc-cdylib — single-threaded by Bacula contract. Drains PTCOMM frames synchronously and translates each frame into a save_pkt/pluginIO cycle.
  • hpc-backend — spawns a rayon thread pool sized to nproc. All metadata work is parallel. A single single-threaded PTCOMM emitter consumes the bounded MPSC channel and writes to stdout — that is the only serialization point on the producer side.

The practical result: the bottleneck shifts from the FD walker to the consumer (the FD’s own thread draining PTCOMM). The backend produces at filesystem line rate, and the FD becomes the limit — a much higher limit.

3. Why sharding is mandatory at HPC scale

Even with perfect in-backend parallelism, there is a ceiling: bacula-fd holds a single SD connection per job. Bacula 15.x has no within-job stream multiplexing. The only way to multiply outbound throughput is to run N concurrent jobs, with Maximum Concurrent Jobs applied at Director, Job, Client, and Storage.

The hpc-shard crate makes this an operator dial:

podheitor-hpc gen-fileset --strategy inode-hash --shards 4 
    --path /lustre/scratch --output /opt/bacula/etc/conf.d/

Produces 4 FileSets and 4 Job snippets, each with:

Plugin = "podheitor-hpc:path=/lustre/scratch;shard=K/4;mode=lustre"

Run them concurrently for 4× outbound SD streams.

3.1 Sharding strategies

Strategy When to use Notes
None Single shard, small dataset, manual config Phase 0 default
InodeHash { shard, of } General parallel default — uniform distribution xxh3-based; deterministic across runs
Subtree { path } Different subtrees deserve different policies Composes well with per-subtree InodeHash
MtimeBucket { shard, of } Incrementals across many shards Pair with Level = Incremental
LustreMdt { shard, of } Lustre — pin one shard per MDT Reads MDT info from lfs df -i. Falls back to InodeHash if absent.

3.2 The serialized-volume gotcha

A detail that destroys naïve sharding gains: the Bacula SD serializes block writes per volume. If all N shards land on the same volume, you get N parallel FD walkers feeding 1 serial SD writer — you get correctness (parity, union coverage) but not wall-time speedup.

Phase 3 lab measurement captured exactly this: 4 shards on the same Volume → 13s vs single-stream baseline of 10s (i.e., slower). To unlock the speedup the architecture promises, choose one of three topologies:

Topology Setup Tradeoff
Per-shard Pool One Pool per shard, with its own volume sequence (LabelFormat = "Shard-{N}-"). gen-fileset emits a Pool snippet alongside FileSet/Job. Cleanest; volumes never collide.
MaximumVolumeJobs = 1 Set on the existing default Pool. Bacula allocates a fresh volume per job. Lighter-weight; produces lots of small volumes, harder to manage long-term.
Multiple Storages → distinct devices Round-robin gen-fileset --storage File1,File2,File3,File4. Each Storage points at a different SD Device, with its own volume pool. Best when SD has multiple disks/spindles to fan out to.

4. Changelog drivers (true incrementals)

The leap from “full scan” to “real incremental” depends on not making 10⁹ stat() calls. Each parallel filesystem exposes its own changelog mechanism; the hpc-changelog crate abstracts these behind the ChangelogSource trait:

Filesystem Mechanism What it captures
Lustre 2.14+ lfs changelog create/unlink/rename/rmdir/setattr per-MDT, with consumer cookie registered to ensure correct slot release
IBM Spectrum Scale (GPFS) 5.x mmapplypolicy with policy template Files with MODIFICATION_TIME > X via GPFS-native parallel metadata scan
CephFS Recursive rstats + rctime Whole subtrees can be pruned when rctime <= last_run — logarithmic cut on the tree
BeeGFS 7.x Metadata-shard scan Each metadata target is walked in parallel; no formal changelog, but the shard scan is already highly parallelizable
POSIX (fallback) PosixWalk with mtime >= last_run When none of the above is available

5. Stripe-aware parallel reader

On Lustre, a “large” file is striped across N OSTs. Reading it sequentially via the read() syscall on the client generates serialized RPCs per OST — wasting native parallelism. The hpc-stripe crate uses llapi_layout_get_by_path to discover the stripe layout and emits N concurrent reads (one per OST), reassembling in-order through PTCOMM.

On GPFS the equivalent is the NSD API; on CephFS/BeeGFS/WekaFS the reader falls back to per-chunk parallel POSIX. Typical gain on files > 1 GiB with stripe count = 4: 3.2-3.8× over sequential reads (theoretical limit 4×, the gap is reassembly overhead).

6. Field benchmarks (Phase 10)

Benchmarks live in the project’s STRESS_RESULTS.md and are reproducible via scripts/stress-1m-files.sh. Representative result:

Date FS Mode Files Walker Backup Throughput
2026-05-03 Lustre (1-OST lab) lustre 100,000 9.97 s 135.0 s 740 files/s

Critical interpretation notes:

  • Walker isolated: ~10K files/s — the Phase 2 ratio (7× faster than find_one_file) holds, because the walker’s logic is identical on ext4 and Lustre.
  • Full backup (walker + read + PTCOMM): ~740 files/s on this hardware — read-bound on a single OST. In production with 4-8 OSTs and the stripe-aware reader, throughput scales linearly until the MDT RPC ceiling.
  • Extrapolation: 10M files ≈ 3h45m on the same single-OST hardware — well inside the 8h overnight window from Phase 10’s original AC. On real HPC hardware (8 OSTs, 32 vCPU, 128 GB RAM on the FD), expect 30-90 minutes.
  • Zero errors, zero skipped files on catalog-vs-find parity validation.

6.1 Memory budget

The MPSC channel is bounded (default channel_depth = 8192 entries). Each entry carries metadata + a small file prefix; large files stream in chunks, not buffered. Target backend RSS:

  • < 256 MiB at 1 M files
  • < 1 GiB at 1 B files

If RSS grows, the suspect is almost always the channel_depth × avg entry size product. Lower channel_depth, not workers (lowering workers slows down without saving RAM).

7. Phase 7 — running the scan on a compute node via Slurm

Default: the backend runs in the same process the cdylib spawned (on the FD host). Adding scheduler=slurm to the plugin command moves the parallel walker / changelog drain / stripe reader onto a Slurm allocation, while the cdylib stays on the login node:

Plugin = "podheitor-hpc:path=/lustre/scratch;shard=0/4;mode=lustre;
          scheduler=slurm;partition=backup;cpus=8;mem=16G;time=1h"

End-to-end flow:

  1. The cdylib parses the command and execs podheitor-hpc-backend backup --scheduler=slurm --partition=backup --cpus=8 ....
  2. That backend instance becomes the submitter: binds an AF_UNIX socket on the shared FS (under the FileSet root, or $PODHEITOR_HPC_SHARED_DIR), then asks SlurmDriver to sbatch --wrap a worker invocation onto the cluster.
  3. SchedulerDriver::submit returns the Slurm JobId; the submitter blocks on accept(2).
  4. The Slurm-allocated worker connects to the socket, dup2‘s the socket fd over its stdout, and runs the same backup codepath — PTCOMM frames flow into the socket as if it were stdout.
  5. The submitter relays bytes verbatim onto its own stdout, which the cdylib’s FilePuller drains. The relay is opaque to PTCOMM framing; cap is whatever the socket buffer + the 64 KiB relay() shovel can sustain (≥ 5 GiB/s on 10 GbE).

7.1 Slurm-aware throttle

The podheitor-hpc throttle-daemon polls squeue -t R -o "%Q" every 10s; on contention it rewrites /var/lib/podheitor-hpc/throttle.toml. Every running worker watches that file’s mtime on a 5s poll and applies new {rayon_threads, read_buffer_kib, pause_ms} mid-run. The TOML is write-then-rename, so readers never see torn content.

Why a file (not a Unix socket or signal):

  • Operators can cat and edit it during incidents.
  • Survives daemon restart.
  • One-writer/many-readers needs zero protocol.
  • We already pay 5s of jitter — adding inotify would gate on filesystems that don’t propagate it cross-mount (some Lustre client configurations).

8. Restripe-on-restore

Backup preserves bytes — but in HPC, preserving striping is just as important. A 100 GB file striped across 8 OSTs and restored to 1 OST loses 8× read bandwidth. The plugin captures the original layout via llapi_layout_get_by_path and serializes it as a Bacula RestoreObject. On restore, before the first byte of data, the cdylib applies llapi_layout_file_create to recreate the exact stripe layout — then writes bytes normally.

9. Documented anti-patterns

  • Don’t run the plugin against / of a parallel filesystem. Always pin to a subtree (Subtree { path }) or shard with inode-hash. Backing up MDT root inodes risks contention with production metadata RPCs.
  • Don’t unset panic = "abort". A panic across FFI through pluginIO is UB. The release profile aborts — intended behavior.
  • Don’t run the FD on a Lustre OSS / GPFS NSD server. The plugin is an HPC client; deploying on storage nodes is supported only via the Slurm submit path (Phase 7) where the scan runs in a compute job.

10. License posture

The plugin ships under LicenseRef-PodHeitor-Proprietary. It does not statically link any Bacula AGPLv3 source. The binding is exclusively via independent extern "C" declarations in the bacula-fd-abi crate. The info-&gt;plugin_license = "Bacula AGPLv3" field satisfies only the FD loader’s runtime gate — there is no transitive AGPL.

Ready to evaluate?

Free 30-day trial for qualified HPC workloads (Lustre, GPFS, BeeGFS, CephFS, WekaFS). We commit to at least 50% discount vs Bacula Enterprise, Veeam or Commvault, with more capabilities included.

Heitor Faria — Founder, PodHeitor International
[email protected]
☎ +1 (789) 726-1749 · +55 (61) 98268-4220 (WhatsApp)
🔗 PodHeitor HPC plugin page · Whitepaper PDF (executive)

Disponível em: pt-brPortuguês (Portuguese (Brazil))enEnglishesEspañol (Spanish)

Leave a Reply