Pool Failover Instant Fallback Loop Between Primary/Backup
Warning — Should be addressed soon
Symptoms
- Miner dashboard or `cgminer -api stats` shows rapid alternation between configured pools — connection bouncing every 30 seconds to 5 minutes
- Pool-side dashboards on both configured pools show the worker connecting and disconnecting in a tight loop, never sustaining 15+ minutes
- Realized hashrate is 40-70% below nameplate despite chips reporting healthy temperatures and HW% under 2%
- Stratum log lines like `Pool 0 alive`, `Pool 0 not responding`, `Pool 1 alive`, `Pool 1 not responding` repeating in cycles
- Rejected share rate spikes during the seconds following each pool switch (stale work submitted to the new pool from previous pool's job context)
- `mining.subscribe` and `mining.authorize` log lines appear repeatedly without the matching long quiet window of healthy hashing
- `Detected new block` notifications come through out of order relative to wall-clock — block hashes alternating between two pools' views
- On Bitaxe / NerdMiner / open-source firmware: `stratumURL` field flips between the two configured URLs every API refresh
- Backup pool — configured `for emergencies` and not expected to take real traffic — now showing the same hashrate share as primary
- Both pools report normal status independently when checked from a different connection (pools aren't down — your path to them is intermittent)
- Total payout this period is materially lower than your hashrate-weighted expectation, with shares scattered across both pool accounts
- Switching the miner to a third unconfigured pool exclusively eliminates the symptoms — confirms failover-policy loop, not a pool-side problem
Step-by-Step Fix
Open the miner's pool configuration UI and add a tertiary pool. Most firmware exposes pool1 / pool2 / pool3 (some up to 8). Pick a third pool with a different operator and a different network path — Ocean, F2Pool, ViaBTC, public-pool.io, solo.ckpool.org, or your self-hosted node, depending on your strategy. The third pool gives the failover state machine somewhere to actually rest. Save, reboot, observe 30 minutes — the loop should immediately quiet down because the miner now has a stable target when both original pools are flapping.
Verify pool order matches your strategy. Pool 1 = your preferred (highest weight, lowest latency, best fee). Pool 2 = same model as pool 1 (don't mix solo with FPPS unless intentional). Pool 3 = catch-all you'd be okay running on for a week if both top pools went down. Reboot the miner. Confirm the dashboard shows pool 1 as the active connection within 60 seconds of boot.
Hard power-cycle the miner — 30 seconds off at the breaker, then back on. Clears any wedged Stratum task state from the previous loop and confirms the new config takes effect cleanly. Watch the logs for the first 5 minutes after boot — you want to see one clean `mining.subscribe` succeed, then long quiet runs of `mining.notify`, not a rapid sequence of subscribe / authorize / disconnect / reconnect.
Verify each configured pool is reachable. From a laptop on the same network: `nc -zv pool-host 3333` for each (substitute correct port). All three should return `succeeded`. If one fails, that pool is wrong for your network path — replace with a different operator or different region. Run the test three times over 15 minutes to catch intermittent pools.
Disable any unused pool slots. Some firmware retains stale configurations from previous testing — if pool4 through pool8 have leftover URLs, blank them out completely. Confirms the failover state machine isn't probing dead URLs and adding noise to its decision-making.
Set a failback cooldown / minimum dwell time on firmware that exposes it. On DCENT_OS, Braiins OS+, LuxOS, Vnish, NerdNOS: look for `failback_delay`, `min_dwell_time`, or `pool_settle_time`. Set to 600 seconds (10 minutes). This means once on a fallback pool, do not failback to a higher-priority pool until it has been continuously available for 10 minutes. This is the single most effective software fix for the loop. Stock Bitmain / MicroBT firmware doesn't expose this — see Tier 3 for the firmware flash.
Enable TCP-level keepalives. Stratum has no protocol-level keepalive, so client-side `SO_KEEPALIVE` is the only defence against silent NAT eviction. On firmware that exposes it, set `tcp_keepalive_idle = 60`, `tcp_keepalive_interval = 30`, `tcp_keepalive_count = 5`. The miner now sends a TCP-level probe every 60 seconds of idle, keeping the NAT entry warm and detecting dead connections in 60-150 seconds instead of 300+.
Tune router NAT table. Web UI of router → Advanced → NAT or Connection Tracking. Increase `tcp_established` timeout from default (often 60-600 s) to 7200 s. Increase max NAT entries if your router exposes the setting. If multiple miners share the router, check connection-tracking table size against actual count of miner sessions × 3 pools each. ISP-supplied gateways are often the bottleneck — consider a Mikrotik or OPNsense replacement (see Tier 3).
Switch to weight-based load-balance routing instead of strict priority. cgminer: `--load-balance` plus per-pool `--quota` settings. Braiins OS+ / DCENT_OS / LuxOS: `balance` mode in pool config, with weights (e.g. 70/20/10 for primary/backup/tertiary). This eliminates the loop structurally — every pool is always in use, no failover transition. Trade-off: variance increases slightly, especially if pool 3 has a different fee/payout model. Right answer for chronic-flap environments where uptime matters more than purity.
Add explicit per-pool authorization headers. If your pool requires worker-specific authorization, ensure each of your configured pools has the right worker name (e.g. `wallet.worker1` for pool 1, possibly different format for pool 3). A failed `mining.authorize` looks identical to a failed `mining.subscribe` in some firmware logs and contributes to false-positive failovers. Verify by tail-watching the log for one full pool transition cycle.
Antminer only: flash DCENT_OS for full failover-policy controls. Stock Bitmain firmware gives you exactly one failover knob (pool order). DCENT_OS — D-Central's open-source Antminer firmware — exposes the full set: failback cooldown, minimum dwell time, weight-based load balancing, TCP keepalives, per-pool keep-warm probing. Built by Mining Hackers, fully open-source, no licensing. Flash, configure 3 pools with weights 60/30/10 and a 10-minute failback cooldown, reboot, observe 30 minutes. Loop should be gone. Alternatives: Braiins OS+, LuxOS, Vnish — all expose the same controls. Stock Bitmain / MicroBT does not.
Whatsminer / Avalon: use vendor-tool failover settings where they exist. MicroBT btminer firmware exposes a partial set of failover knobs through the BTMiner app and `btminer.conf`. Avalon AvalonMiner has a similar but more limited config. Both are less flexible than DCENT_OS / Braiins OS+ / LuxOS — but those firmware projects don't currently support Whatsminer / Avalon hardware (DCENT_OS Whatsminer/Avalon support is on D-Central's roadmap, not shipping today). Use what's there: set `pool_keepalive=true` and `pool_failover_minimum_dwell=600` if exposed.
Bitaxe / NerdMiner / open-source: read the upstream issue tracker. ESP-Miner #1618 on GitHub is the canonical thread for failback policy on Bitaxe. NerdNOS handles failover differently and is generally more configurable. If you're seeing a failover loop (not stickiness — see the related Bitaxe Fallback error) on a Bitaxe specifically, your two pools are flapping faster than ESP-Miner's settle window — the structural fix is to add a tertiary or to switch to NerdNOS firmware where supported on your hardware variant.
Run a watchtower script. A small script on a Raspberry Pi, Home Assistant, or any always-on machine: every 60 s, poll each miner's `/api/system/info` (or cgminer API) and snapshot `current_pool`. If the last 5 snapshots show ≥4 distinct pools, trigger a `POST /api/system/restart` and log an alert. Cap restart attempts at 3 per hour to prevent restart-loops on top of failover-loops. This is reactive, not preventative — but catches loops you didn't predict, and provides the data for tuning your structural fix.
Run a stratum proxy in front of all miners (fleet-scale). Braiins Farm Proxy, ckpool's `ckproxy`, or stratum-proxy let you configure failover policy centrally — your miners point to one proxy, the proxy handles all upstream failover with whatever logic you write. Trade-off: extra moving part, single point of failure if the proxy itself goes down. For fleets of 5+ miners this is the right architecture; for a single home miner, Tier 1 + Tier 2 alone is enough. Run the proxy on a small Pi 4 or thin-client mini-PC; configure 3 upstream pools with weights.
When to stop DIY: Tier 1 + Tier 2 + Tier 3 deployed correctly and the loop persists, AND you've proven via single-pool testing that each pool sustains independently. At this point you have a network-path issue (ISP, DNS, NAT) beyond pool config. Open a D-Central support ticket with: 30-minute log capture from the affected miner(s), NAT table size of your router, router model, ISP name, list of pools tested, single-pool test results. We'll triage the upstream cause.
Pool-strategy consultation: if the failover loop revealed your pool selection doesn't match your mining strategy (e.g. solo-to-FPPS mixed without intent), book a Mining Consulting session. We'll structure pool selection, weights, and failover policy against your actual revenue and variance goals — Bitcoin maximalist, Canadian-power-cost-aware, no corporate filler. Typical engagement: 1-2 hours, $150-$400 CAD, output is a written pool-strategy spec you can hand to whoever runs your fleet.
Firmware flash service: if you want DCENT_OS deployed on an Antminer fleet but don't want to flash yourself, D-Central offers firmware flash service alongside ASIC Repair. Drop the controllers off (or ship them in for fleet operators), pick them up flashed and configured. Typical turnaround 5-10 business days. Pricing per controller — check the ASIC Repair service page or contact us for fleet quotes.
When to Seek Professional Repair
If the steps above do not resolve the issue, or if you are not comfortable performing these repairs yourself, professional service is recommended. Attempting advanced repairs without proper equipment can cause further damage.
Related Error Codes
Still Having Issues?
Our team of Bitcoin Mining Hackers has been repairing ASIC miners since 2016. We have seen it all and fixed it all. Get a professional diagnosis.
