Technical whitepaper — PodHeitor Global Deduplication for Bacula - PodHeitor

A standalone global dedup engine that operates as a Storage Daemon sidecar, with variable-length chunking (FastCDC), multi-layer Bloom filter, RocksDB index, client-side dedup via FD plugin, and Prometheus metrics — competing directly with Bacula Enterprise GED and Commvault DDB.

Companion technical document to the PodHeitor Global Dedup plugin page.

1. The problem: Bacula Enterprise GED is fixed-block + TokyoCabinet

The GED (Global Endpoint Deduplication) in Bacula Enterprise uses fixed-size chunking and a TokyoCabinet B-tree as the hash DB. That has three structural limitations:

Cascade effect on modifications. Inserting 1 byte at the start of a 10 GB file invalidates every subsequent block — fixed-block boundaries shift. Result: dedup ratio collapses on log workloads, DB dumps and VM snapshots.
SSD-bound on IOPS. Every TokyoCabinet lookup is an SSD seek. Without a Bloom filter in front, the hot path touches disk per block.
No client-side dedup. Every byte crosses the FD→SD link before dedup runs. On remote backup, that wastes bandwidth.

PodHeitor GDD attacks all three: FastCDC variable-length chunking, Bloom L1 + RocksDB L2, and client-side fingerprinting via FD plugin before payload transmits.

2. Architecture — SD sidecar + client-side FD plugin

┌─────────────────────────────────────────────────────────────────────────────┐
│                      BACULA INFRASTRUCTURE (UNMODIFIED)                      │
├─────────────────────────────────────────────────────────────────────────────┤
│  ┌──────────┐        ┌──────────────┐        ┌──────────────────────────┐   │
│  │  Bacula  │  TCP   │  Bacula FD   │  TCP   │     Bacula SD            │   │
│  │ Director │◄──────►│ + PodHeitor  │◄──────►│ + PodHeitor SD Plugin    │   │
│  │          │        │  FD Plugin   │        │   (Rust cdylib v2)       │   │
│  └──────────┘        └──────┬───────┘        └───────────┬──────────────┘   │
│                              │ Fingerprints              │ Data stream       │
│                              ▼                            ▼                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                       PODHEITOR GDD ENGINE (RUST)                            │
├─────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌───────────────┐  ┌──────────────────────────┐           │
│  │  FastCDC    │  │ Bloom Filter  │  │   Dedup Index            │           │
│  │  Chunker    │──►│  (L1 Cache)  │──►│  (RocksDB + RAM)         │           │
│  │  (variable) │  │  FPR: 0.1%   │  │  SHA-256 → ContainerRef  │           │
│  └─────────────┘  └───────────────┘  └──────────────────────────┘           │
│         │                                        │                          │
│         ▼                                        ▼                          │
│  ┌─────────────┐                     ┌──────────────────────────┐           │
│  │  Segment    │                     │   Container Store        │           │
│  │  Locality   │────────────────────►│  (append-only files)     │           │
│  │  Tracker    │                     │  + segment grouping      │           │
│  └─────────────┘                     └──────────────────────────┘           │
└─────────────────────────────────────────────────────────────────────────────┘

The FD plugin is a Rust cdylib (workspace plugin-gdd). The SD plugin is Rust cdylib + FFI binding (libgdd_ffi_sd.so) that talks to the GDD daemon via TCP+PSK on port 9104, with Prometheus metrics on 9421. Zero modification to Bacula source.

3. FastCDC — variable-length chunking

FastCDC (Fast Content-Defined Chunking) uses a gear table-based rolling hash to find content-deterministic boundaries:

Parameter	Default	Range
Min chunk	4 KB	2–8 KB
Avg chunk	16 KB	8–64 KB
Max chunk	64 KB	32–256 KB
Hash bits	48	fixed

Why variable-length over fixed-size? On modifications that insert or remove bytes, fixed-size desyncs everything past the insertion point (cascade effect). Variable-length re-finds the natural boundary after a few bytes — 50–70% better ratio on DB dumps, logs, VM snapshots.

4. Bloom L1 — SSD I/O elimination

Before any RocksDB query (expensive on SSD), the in-RAM Bloom filter answers “definitely new” or “maybe exists”. Counting Bloom Filter with partitioned hashing:

Parameter	Value	Rationale
Expected items	100M chunks	~1.6 TB unique data at 16 KB avg
FPR target	0.1% (1 in 1000)	Acceptable
Bits per item	14.4	Optimal for 0.1% FPR
Total memory	~180 MB	Fits in RAM easily
Hash functions	10	k = ln(2) × m/n
Persistence	mmap’d file	Instant load, crash-safe

Measured result: 90%+ SSD I/O reduction on the hot path vs direct DB lookup.

5. Two operating modes

Mode	Behaviour	When to use
Mode A	SD plugin transparent; dedup happens server-side; FD untouched	Migration path / legacy clients that can’t change FD
Mode B	FD plugin computes fingerprint client-side; only new blocks cross the link	Production default — 60–95% bandwidth saving

In Mode B, the FD reads the file, FastCDC chunks it, generates a SHA-256 fingerprint per chunk, sends the list to the GDD daemon. The daemon answers “have it” or “need it”. The FD only transmits what’s new.

6. Reliability — fail-closed mandatory

The Mode B lab went through a documented bug sequence (B3a/b/c/d, B4, B5) that taught one operational principle: fail-closed > fail-open. In particular, any error path that leaves state->active=false with sp->no_read=false silently emits 0-byte file records with Termination: Backup OK — the worst possible failure in backup. Production fixes 2026-04-29 → 2026-05-01:

Bug	Cause	Fix
B3c (daemon-restart mid-job)	Stale state after reconnect	Fail-closed JOB_START + per-file passthrough fallback (with `lseek(fd, 0, SEEK_SET)` before reusing fd)
B3a (FD kill)	Job stayed in status I forever	Auto-reschedule by Director
B3b (SD kill)	Partial volume undetected	Fatal status; recovery via re-run
B5 (cargo 0-byte)	VREF build 24.6 MB exceeded STORE_NEW 16 MiB cap	FDRAW passthrough on three fallback paths
Foreign-plugin guard	Plugin engaged on ADABAS jobs	SELF_PREFIX gate on `PluginCommand`

7. Garbage collection — concurrent mark-sweep

The GDD daemon runs concurrent vacuum in the background: mark-sweep over the index, refcount drop on containers, compaction of container holes. Unlike BEE GED (maintenance via scheduled “DDB maintenance job”), PodHeitor vacuum:

Runs at low priority, no maintenance window required.
Doesn’t block live jobs (read+write on append-only containers).
Reports progress via /metrics Prometheus.

8. BR/DR — chunk-store recovery

The BR_DR_RUNBOOK.md documents cold-stop tar + LVM/ZFS snapshot + soft-loss recovery (delete bloom/esm, daemon rebuilds). Caveats:

gdd-fsck --rebuild-index not yet implemented (post-GA item).
manifests/ empty in current state — write path under verification.
Live-snapshot API not available; use LVM/ZFS at the host level.

9. Prometheus metrics

The /metrics endpoint on port 9421 exposes:

gdd_chunks_total, gdd_chunks_unique, gdd_chunks_duplicate
gdd_bytes_in, gdd_bytes_stored, derived ratio
gdd_bloom_hits, gdd_bloom_false_positives
gdd_rocksdb_get_seconds_bucket (histogram)
gdd_vacuum_cycles_total, gdd_vacuum_bytes_reclaimed
gdd_daemon_rss_bytes (target ≤ 200 MB sustained)

10. Documented anti-patterns

Don’t dedup single files > 134 MiB cap without FDRAW passthrough. The VREF protocol cap is 16 MiB; large files go via OVERSIZE-PASSTHROUGH. The lab proved this with a 12 GB randombig.bin crossing volume rollover.
Don’t confuse volume size mismatch with a plugin bug. Differences with abs(N-M) < 65536 are Bacula core self-heal — operator runbook B1.
Don’t run concurrent ops on the same container without the daemon active. The daemon is the single source of truth for refcount.
Don’t kill the daemon mid-job without accepting fail-closed on the current job. Intentional: fail-closed by design vs silent 0-byte bug.

11. License posture

The plugin ships under LicenseRef-PodHeitor-Proprietary. No Bacula AGPLv3 source is statically linked. The FD cdylib, SD cdylib and daemon are all pure Rust binaries. SD integration uses only the public sd_plugins.h Bacula API via extern "C".

Ready to evaluate?

Free 30-day trial for multi-TB workloads with expected dedup ratio > 4×. We guarantee at least 50% discount vs Bacula Enterprise GED, Veeam Repository or Commvault DDB, with more features included.

Heitor Faria — Founder, PodHeitor International
✉ [email protected]
☎ +1 (789) 726-1749 · +55 (61) 98268-4220 (WhatsApp)
🔗 PodHeitor Global Dedup plugin page

Disponível em: Português (Portuguese (Brazil))EnglishEspañol (Spanish)