A standalone global dedup engine that operates as a Storage Daemon sidecar, with variable-length chunking (FastCDC), multi-layer Bloom filter, RocksDB index, client-side dedup via FD plugin, and Prometheus metrics — competing directly with Bacula Enterprise GED and Commvault DDB.
Companion technical document to the PodHeitor Global Dedup plugin page.
1. The problem: Bacula Enterprise GED is fixed-block + TokyoCabinet
The GED (Global Endpoint Deduplication) in Bacula Enterprise uses fixed-size chunking and a TokyoCabinet B-tree as the hash DB. That has three structural limitations:
- Cascade effect on modifications. Inserting 1 byte at the start of a 10 GB file invalidates every subsequent block — fixed-block boundaries shift. Result: dedup ratio collapses on log workloads, DB dumps and VM snapshots.
- SSD-bound on IOPS. Every TokyoCabinet lookup is an SSD seek. Without a Bloom filter in front, the hot path touches disk per block.
- No client-side dedup. Every byte crosses the FD→SD link before dedup runs. On remote backup, that wastes bandwidth.
PodHeitor GDD attacks all three: FastCDC variable-length chunking, Bloom L1 + RocksDB L2, and client-side fingerprinting via FD plugin before payload transmits.
2. Architecture — SD sidecar + client-side FD plugin
┌─────────────────────────────────────────────────────────────────────────────┐
│ BACULA INFRASTRUCTURE (UNMODIFIED) │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌──────────┐ ┌──────────────┐ ┌──────────────────────────┐ │
│ │ Bacula │ TCP │ Bacula FD │ TCP │ Bacula SD │ │
│ │ Director │◄──────►│ + PodHeitor │◄──────►│ + PodHeitor SD Plugin │ │
│ │ │ │ FD Plugin │ │ (Rust cdylib v2) │ │
│ └──────────┘ └──────┬───────┘ └───────────┬──────────────┘ │
│ │ Fingerprints │ Data stream │
│ ▼ ▼ │
├─────────────────────────────────────────────────────────────────────────────┤
│ PODHEITOR GDD ENGINE (RUST) │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌───────────────┐ ┌──────────────────────────┐ │
│ │ FastCDC │ │ Bloom Filter │ │ Dedup Index │ │
│ │ Chunker │──►│ (L1 Cache) │──►│ (RocksDB + RAM) │ │
│ │ (variable) │ │ FPR: 0.1% │ │ SHA-256 → ContainerRef │ │
│ └─────────────┘ └───────────────┘ └──────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌──────────────────────────┐ │
│ │ Segment │ │ Container Store │ │
│ │ Locality │────────────────────►│ (append-only files) │ │
│ │ Tracker │ │ + segment grouping │ │
│ └─────────────┘ └──────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
The FD plugin is a Rust cdylib (workspace plugin-gdd). The SD plugin is Rust cdylib + FFI binding (libgdd_ffi_sd.so) that talks to the GDD daemon via TCP+PSK on port 9104, with Prometheus metrics on 9421. Zero modification to Bacula source.
3. FastCDC — variable-length chunking
FastCDC (Fast Content-Defined Chunking) uses a gear table-based rolling hash to find content-deterministic boundaries:
| Parameter | Default | Range |
|---|---|---|
| Min chunk | 4 KB | 2–8 KB |
| Avg chunk | 16 KB | 8–64 KB |
| Max chunk | 64 KB | 32–256 KB |
| Hash bits | 48 | fixed |
Why variable-length over fixed-size? On modifications that insert or remove bytes, fixed-size desyncs everything past the insertion point (cascade effect). Variable-length re-finds the natural boundary after a few bytes — 50–70% better ratio on DB dumps, logs, VM snapshots.
4. Bloom L1 — SSD I/O elimination
Before any RocksDB query (expensive on SSD), the in-RAM Bloom filter answers “definitely new” or “maybe exists”. Counting Bloom Filter with partitioned hashing:
| Parameter | Value | Rationale |
|---|---|---|
| Expected items | 100M chunks | ~1.6 TB unique data at 16 KB avg |
| FPR target | 0.1% (1 in 1000) | Acceptable |
| Bits per item | 14.4 | Optimal for 0.1% FPR |
| Total memory | ~180 MB | Fits in RAM easily |
| Hash functions | 10 | k = ln(2) × m/n |
| Persistence | mmap’d file | Instant load, crash-safe |
Measured result: 90%+ SSD I/O reduction on the hot path vs direct DB lookup.
5. Two operating modes
| Mode | Behaviour | When to use |
|---|---|---|
| Mode A | SD plugin transparent; dedup happens server-side; FD untouched | Migration path / legacy clients that can’t change FD |
| Mode B | FD plugin computes fingerprint client-side; only new blocks cross the link | Production default — 60–95% bandwidth saving |
In Mode B, the FD reads the file, FastCDC chunks it, generates a SHA-256 fingerprint per chunk, sends the list to the GDD daemon. The daemon answers “have it” or “need it”. The FD only transmits what’s new.
6. Reliability — fail-closed mandatory
The Mode B lab went through a documented bug sequence (B3a/b/c/d, B4, B5) that taught one operational principle: fail-closed > fail-open. In particular, any error path that leaves state->active=false with sp->no_read=false silently emits 0-byte file records with Termination: Backup OK — the worst possible failure in backup. Production fixes 2026-04-29 → 2026-05-01:
| Bug | Cause | Fix |
|---|---|---|
| B3c (daemon-restart mid-job) | Stale state after reconnect | Fail-closed JOB_START + per-file passthrough fallback (with lseek(fd, 0, SEEK_SET) before reusing fd) |
| B3a (FD kill) | Job stayed in status I forever | Auto-reschedule by Director |
| B3b (SD kill) | Partial volume undetected | Fatal status; recovery via re-run |
| B5 (cargo 0-byte) | VREF build 24.6 MB exceeded STORE_NEW 16 MiB cap | FDRAW passthrough on three fallback paths |
| Foreign-plugin guard | Plugin engaged on ADABAS jobs | SELF_PREFIX gate on PluginCommand |
7. Garbage collection — concurrent mark-sweep
The GDD daemon runs concurrent vacuum in the background: mark-sweep over the index, refcount drop on containers, compaction of container holes. Unlike BEE GED (maintenance via scheduled “DDB maintenance job”), PodHeitor vacuum:
- Runs at low priority, no maintenance window required.
- Doesn’t block live jobs (read+write on append-only containers).
- Reports progress via
/metricsPrometheus.
8. BR/DR — chunk-store recovery
The BR_DR_RUNBOOK.md documents cold-stop tar + LVM/ZFS snapshot + soft-loss recovery (delete bloom/esm, daemon rebuilds). Caveats:
gdd-fsck --rebuild-indexnot yet implemented (post-GA item).manifests/empty in current state — write path under verification.- Live-snapshot API not available; use LVM/ZFS at the host level.
9. Prometheus metrics
The /metrics endpoint on port 9421 exposes:
gdd_chunks_total,gdd_chunks_unique,gdd_chunks_duplicategdd_bytes_in,gdd_bytes_stored, derived ratiogdd_bloom_hits,gdd_bloom_false_positivesgdd_rocksdb_get_seconds_bucket(histogram)gdd_vacuum_cycles_total,gdd_vacuum_bytes_reclaimedgdd_daemon_rss_bytes(target ≤ 200 MB sustained)
10. Documented anti-patterns
- Don’t dedup single files > 134 MiB cap without FDRAW passthrough. The VREF protocol cap is 16 MiB; large files go via OVERSIZE-PASSTHROUGH. The lab proved this with a 12 GB
randombig.bincrossing volume rollover. - Don’t confuse volume size mismatch with a plugin bug. Differences with
abs(N-M) < 65536are Bacula core self-heal — operator runbook B1. - Don’t run concurrent ops on the same container without the daemon active. The daemon is the single source of truth for refcount.
- Don’t kill the daemon mid-job without accepting fail-closed on the current job. Intentional: fail-closed by design vs silent 0-byte bug.
11. License posture
The plugin ships under LicenseRef-PodHeitor-Proprietary. No Bacula AGPLv3 source is statically linked. The FD cdylib, SD cdylib and daemon are all pure Rust binaries. SD integration uses only the public sd_plugins.h Bacula API via extern "C".
Ready to evaluate?
Free 30-day trial for multi-TB workloads with expected dedup ratio > 4×. We guarantee at least 50% discount vs Bacula Enterprise GED, Veeam Repository or Commvault DDB, with more features included.
Heitor Faria — Founder, PodHeitor International
✉ [email protected]
☎ +1 (789) 726-1749 · +55 (61) 98268-4220 (WhatsApp)
🔗 PodHeitor Global Dedup plugin page
Disponível em:
Português (Portuguese (Brazil))
English
Español (Spanish)