Skip to content

Fix tc class/filter leaks on the bridge and reconcile them periodically#289

Open
hiroTamada wants to merge 3 commits into
mainfrom
hypeship/fix-tc-class-leak
Open

Fix tc class/filter leaks on the bridge and reconcile them periodically#289
hiroTamada wants to merge 3 commits into
mainfrom
hypeship/fix-tc-class-leak

Conversation

@hiroTamada

@hiroTamada hiroTamada commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Per-VM HTB classes and basic filters leak on the bridge on busy hosts. Since the bridge filter list is evaluated linearly for every VM-egress packet, accumulated leaks degrade network throughput for every VM on the host (observed: ~1,880 stale class/filter pairs vs ~6 live VMs on one production host).

Leak mechanisms fixed

  1. Wrong-ID deletes. Cleanup paths that have no persisted class ID (periodic TAP reaper, orphaned-TAP cleanup, recreate-existing-TAP, TAP-create failure) fell back to the hash-derived class ID. Whenever the original tc class add had been collision-probed to hash+N, these paths deleted the wrong ID and leaked the real class and its filter — a feedback loop, since more stale classes cause more probing. removeVMClass now resolves the class from the bridge filter table via the TAP's ifindex (meta(rt_iif eq N) is the authoritative tap→class link). When neither a persisted ID nor a filter resolves, it deletes nothing and defers to the reconciler instead of guessing.

  2. Silent delete failures. removeVMClass ran every tc delete best-effort with no logging. Failures are now logged and counted via hypeman_network_tc_cleanup_failures_total{operation}.

  3. Reconcile never ran and missed filters. CleanupOrphanedClasses only ran at process startup and iterated classes only, so dangling filters (class deleted, filter left, or TAP gone) were invisible to it. It now:

    • runs every TAP GC tick (60s), not just at startup
    • holds the tc mutex so it can't observe a half-applied rate limit
    • sweeps filters not anchored to a live TAP ifindex, then classes not referenced by a live TAP's filter/derived ID or a stored allocation classid
    • bails out if filters exist but no rt_iif ematch parses, so an unexpected tc output format can't cause a sweep of live state
    • counts removals via hypeman_network_tc_orphans_swept_total{kind}
  4. Double-release noise. The persisted classid file is cleared after an instance-scoped release, so the routine stop→delete double release no longer re-deletes the class.

Out-of-band recovery

scripts/sweep-stale-tc.sh applies the same sweep logic standalone (dry-run by default, --apply to delete) to recover already-polluted hosts immediately, without waiting for a deploy/restart. Candidates are snapshotted before keep-sets so anything allocated mid-run is never considered stale.

Test plan

  • go build / go vet on lib/network and lib/instances
  • go test ./lib/network/ (includes new parseBridgeFilters unit tests)
  • bash -n + dry-run of scripts/sweep-stale-tc.sh
  • Full make test-linux integration suite — not run locally (requires KVM + HTB kernel modules not available in my environment); needs CI or a metal host
  • Validate on a polluted staging/prod host: run the sweep script dry-run, compare counts against tc class show dev vmbr0 | wc -l, then --apply and confirm hypeman_network_tc_class_collisions_total and softirq CPU drop

🤖 Generated with Claude Code


Note

Medium Risk
Changes live Linux bridge traffic-control delete and reconciliation logic under tcMu; incorrect sweeps could affect VM upload shaping, though multiple safety bail-outs limit that risk.

Overview
Fixes leaked per-VM HTB classes and basic filters on the bridge, which linearly slow every VM’s egress when they accumulate.

Release and delete paths now resolve the real class via the bridge filter table (meta(rt_iif eq N)) instead of guessing the hash-derived ID after collision probing; they skip deleting when a persisted ID appears owned by another live TAP. Failed tc deletes are logged and counted. Instance-scoped release clears the persisted classid file after a successful delete so stop→delete doesn’t repeat cleanup.

CleanupOrphanedClasses is extended to sweep dangling filters first, then unreferenced classes (live TAP ifindexes, filter flowids, and stored allocation class IDs), holds tcMu to avoid racing async rate-limit setup, and bails if filter output can’t be parsed safely. It runs on every 60s TAP GC tick (even when the TAP pass is skipped), not only at startup.

Adds hypeman_network_tc_cleanup_failures_total and hypeman_network_tc_orphans_swept_total, unit tests for parseBridgeFilters, and scripts/sweep-stale-tc.sh for dry-run / --apply recovery on already-polluted hosts.

Reviewed by Cursor Bugbot for commit 9a7d193. Bugbot is set up for automated code reviews on this repo. Configure here.

Per-VM HTB classes and basic filters leaked on busy hosts, and the
filter list is walked linearly for every VM-egress packet, degrading
all VM network throughput as leaks accumulate.

Root causes fixed:
- Cleanup paths with no persisted class ID fell back to the hash-derived
  ID, deleting the wrong class/filter whenever the original add had been
  collision-probed to a different ID. The class is now resolved from the
  bridge filter table via the TAP's ifindex (the authoritative tap->class
  link); when nothing resolves, cleanup defers to the reconciler instead
  of guessing.
- removeVMClass swallowed every delete error; failures are now logged
  and counted (hypeman_network_tc_cleanup_failures_total).
- CleanupOrphanedClasses only ran at process startup and never swept
  dangling filters. It now runs every TAP GC tick under the tc mutex,
  sweeps filters not anchored to a live TAP, and counts removals
  (hypeman_network_tc_orphans_swept_total). It bails if no rt_iif
  matches parse, so an unexpected tc output format can't trigger a
  sweep of live state.
- The persisted classid is cleared after instance release so the
  routine stop-then-delete double release doesn't re-delete the class.

scripts/sweep-stale-tc.sh applies the same sweep out-of-band (dry-run
by default) to recover already-polluted hosts without a restart.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@hiroTamada hiroTamada marked this pull request as ready for review June 12, 2026 15:33
Comment thread lib/network/bridge_linux.go
@firetiger-agent

Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Stops VM network throughput from degrading on busy hosts by fixing stale tc class/filter leaks on the hypeman bridge and running periodic reconciliation every 60s.

Intended effect:

  • hypeman_network_tc_orphans_swept_total: new metric, baseline 0 pre-deploy; confirmed working if it increments on previously-polluted hosts then trends toward 0 as the backlog drains
  • hypeman_network_tc_cleanup_failures_total: new metric, baseline 0 pre-deploy; confirmed healthy if it stays near 0 at steady state (non-zero means a tc delete failed and is now visible for the first time)
  • Hypeman spawn failure rate (kernel_invocation_spawn_total{backend_type="hypeman", success=false}): baseline 9–180/hr; confirmed if it stays in this range post-deploy

Risks:

  • Lock contention on polluted hosts — reconciler holds tcMu while issuing hundreds of tc deletes on hosts with ~1,880 stale entries; signal: kernel_invocation_spawn_total{success=false} spike above 500/hr in the 1–2 min after first GC tick post-deploy; alert if sustained >2h
  • Reconciler disarmed by filter parse mismatch — if tc filter show output format doesn't match expected format, the reconciler bails every tick; signal: hypeman_network_tc_cleanup_failures_total{operation="filter_parse"} incrementing, or WARN log "no rt_iif matches parsed from tc filter output"; alert if non-zero
  • Live class incorrectly swept — if ListAllocations returns stale data, a running VM's class could be deleted, halting its egress; signal: hypeman_network_tc_cleanup_failures_total{operation="filter_list"} non-zero or kernel_invocation_spawn_total{success=false} elevation; alert if >500 failed spawns/hr post-deploy

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

A persisted classID can go stale (saveClassID failing after a
collision-probed add), in which case deleting by it removes the wrong
class — or worse, one belonging to another live VM if the stale ID
collides. Resolve the class from the TAP's own filter when one exists,
and skip flowid-matched deletes whose anchor is a live interface.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
"bridge", bridgeName, "tap", tapName, "class", fullClassID,
"error", err, "output", strings.TrimSpace(string(output)))
m.recordTCCleanupFailure(ctx, "class_del")
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter list failure skips safety

Medium Severity

When listBridgeFilters fails in removeVMClass, cleanup still proceeds to delete the HTB class using only the persisted classID. That bypasses the new filter-table checks (including classClaimedByOther) meant to avoid removing another VM’s class after collision probing, reintroducing the wrong-ID delete this change is meant to fix.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4ea5f81. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high effort and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9a7d193. Configure here.

}
m.recordTCOrphanSwept(ctx, "filter")
deleted++
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unparsed rt_iif treated as stale

Medium Severity

The CleanupOrphanedClasses and removeVMClass functions might incorrectly delete filters for active VMs. When the rt_iif value from tc filter show output isn't fully parsed, filters for active TAPs are mistakenly identified as orphaned. This can disrupt upload shaping or, in removeVMClass, lead to deleting another VM's filter.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9a7d193. Configure here.

Comment thread lib/instances/tap_gc.go
return
}
if deleted := m.networkManager.CleanupOrphanedTAPs(ctx, preserve, tapGCMinAge); deleted > 0 {
log.DebugContext(ctx, "TAP GC: no preserve candidates, skipping TAP pass")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TC sweep skipped on list failure

Medium Severity

Periodic bridge tc reconciliation runs inside reconcileTAPs only after ListInstances succeeds. When listing fails, the function returns early and never calls CleanupOrphanedClasses, so stale classes and filters are not swept on that 60s tick despite the PR’s goal of periodic cleanup.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9a7d193. Configure here.

@sjmiller609 sjmiller609 self-requested a review June 30, 2026 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant