Skip to content

HBASE-30137 In FSFT, generate a new manifest incase the latest manifest file gets corrupted#8382

Draft
gvprathyusha6 wants to merge 5 commits into
apache:masterfrom
gvprathyusha6:hbck_recover_manifest
Draft

HBASE-30137 In FSFT, generate a new manifest incase the latest manifest file gets corrupted#8382
gvprathyusha6 wants to merge 5 commits into
apache:masterfrom
gvprathyusha6:hbck_recover_manifest

Conversation

@gvprathyusha6

Copy link
Copy Markdown
Contributor

No description provided.

Prathyusha Garre and others added 5 commits June 20, 2026 01:08
Keep dev-support/design-docs/fsft-manifest-repair.md (the canonical design
referenced by StoreFileListRepair and carrying the two-track procedure+CLI
decision). Delete the two superseded drafts:
- fsft-manifest-repair-lld.md: predates the online-procedure decision
  ("No new RPC, no master integration, no online HBCK plumbing").
- fsft-repair-manifest-copy.md: early offline-only copy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the in-progress online FSFT manifest "repair" path with a single
offline, operator-driven CLI that rebuilds a corrupted FILE store-file-tracker
manifest (.filelist) purely from the on-disk store listing.

Engine + CLI:
- StoreFileListRecover: disk-only reconstruction. The recovered manifest is
  exactly the set of store files physically present under the family directory
  (HFiles, references, links), filtered with DefaultStoreFileTracker rules; the
  Reference body is carried into the manifest entry. Nothing is synthesized from
  split/merge lineage.
- For user-table regions it consults hbase:meta for split/merge parents and
  reports data-loss risk (parents with unarchived HFiles) without ever injecting
  parent-derived entries into the manifest.
- isAlreadyHealthy() mirrors the runtime load selection (numeric seqId ordering,
  f1/f2 winner by timestamp) so a no-op cannot mask corruption of a
  higher-seqId tracker file.
- StoreFileListRecoverTool: CLI surface (sftrecover) with safety gates --
  requires --region-offline or --dry-run before writing, refuses hbase:meta
  without --force-meta, refuses non-FILE/MIGRATION trackers.

Removals:
- Drop the online repair surface entirely: RepairFsftRegionProcedure, the
  Hbck.repairFsftRegion RPC + HBaseHbck impl, the Master.proto /
  MasterProcedure.proto RPC + messages + state, and the MasterRpcServices
  handler. Nothing in the master can fence a RegionServer off the store dir
  while a manifest is rewritten, so offline-only is the correct boundary.
- Restore StoreFileListFilePrettyPrinter to a pure read-only viewer (the repair
  logic that had been embedded there now lives in the recover tool).

Wire `hbase sftrecover` into bin/hbase and bin/hbase.cmd. Add
TestStoreFileListRecover (11 tests) and the fsft-manifest-recover design doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
testRestoreSnapshotAfterSplitWithCompactionsDisabled (and its helpers) was
added by the initial branch commit but is unrelated to the offline FSFT
manifest-recover tool that this branch/PR delivers. Restore the file to its
upstream/master state so it no longer appears in the PR diff.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This test was an empirical exploration harness added by the initial branch
commit: it starts a mini cluster only to LOG whether hbase:meta inherits the
FILE tracker, and asserts nothing meaningful about the recover tool (its own
comments say "we assert nothing definitive ... the LOG output is the real
evidence"). It is not part of the offline FSFT manifest-recover feature, so
remove it from the branch/PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant