Error Mitigation & Recovery
Overview
HieraChain provides layered, automated recovery mechanisms across three dimensions: network resilience, consensus leader recovery, and state rollback. These operate independently and can be active simultaneously.
6A β Network Recovery
flowchart TB
NET["π Network Issue Detected"]
LAT["π Collect Latency History"]
ADJ["β±οΈ adjust_timeout()\nRecalculate based on avg + max latency"]
RED["π‘ send_with_redundancy()\nSend via N parallel paths simultaneously"]
FIRST["β
First successful response wins"]
PART["β οΈ Partition Detected?\navg_latency > 5000ms"]
VC["π _initiate_view_change()\nTrigger BFT view change"]
NET --> LAT --> ADJ --> RED --> FIRST
RED --> PART
PART -->|Yes| VC
PART -->|No| FIRST
Strategy: send_with_redundancy() dispatches the same message over N parallel network paths simultaneously. The first successful response wins and remaining in-flight calls are cancelled. This handles intermittent path failures without explicit retry logic.
6B β Consensus Recovery (Leader Failure)
sequenceDiagram
autonumber
participant CRE as π§ ConsensusRecoveryEngine
participant VM as π BFTViewChangeManager
participant NEW as π New Leader
Note over CRE: Leader timeout detected
CRE->>CRE: handle_leader_failure(failed_leader_id, current_view)
CRE->>CRE: Check recovery_attempts < max (default 3)
CRE->>VM: _initiate_view_change(failed_leader, new_view = view + 1)
VM->>VM: Broadcast VIEW-CHANGE to all validators
VM->>NEW: Elect new leader: Validators[new_view % n]
NEW->>NEW: Restart PRE-PREPARE phase
CRE->>CRE: Clear recovery_attempts on success
6C β State Rollback
flowchart LR
ERR["β Critical Error\nor Integrity Failure"]
SNAP["πΈ Load Snapshot\n(RollbackManager)"]
JRNL["π Replay Journal\n(TransactionJournal)"]
VER["π Validate Restored State\n(DataValidator)"]
OK["β
State Restored"]
ALERT["π¨ Alert + Escalate\n(AlertManager)"]
ERR --> SNAP --> JRNL --> VER
VER -->|Valid| OK
VER -->|Invalid| ALERT
Rollback steps:
1. RollbackManager.load_snapshot() β load the most recent consistent snapshot
2. TransactionJournal.replay() β replay committed journal entries since the snapshot
3. DataValidator.validate() β verify the restored state against cryptographic checksums
4. If validation fails: escalation alert sent via Risk Alerts; manual intervention required
Step-by-Step Breakdown
| Sub-flow | Trigger | Action |
|---|---|---|
| 6A Network | avg_latency > threshold |
Adaptive timeout + parallel redundant send |
| 6A Partition | avg_latency > 5000ms |
Trigger BFT View Change (BFT Consensus) |
| 6B Leader | leader_timeout |
ConsensusRecoveryEngine.handle_leader_failure() β View Change |
| 6B Max retries | recovery_attempts β₯ max |
Log critical error, alert, halt consensus |
| 6C Rollback | Integrity failure or critical error | Snapshot β Journal replay β Validate |
Error Handling
| Condition | Behavior |
|---|---|
| All recovery paths exhausted (6B) | Critical alert sent, node halts consensus participation |
| Snapshot not found (6C) | Full rehydration from DB (Chain Rehydration) attempted |
| Journal replay produces invalid state (6C) | Alert escalated via Risk Alerts, manual intervention flagged |
| Network partition heals | Adaptive timeout reduces automatically, normal flow resumes |
Key Classes & Methods
| Step | Class / Method | File |
|---|---|---|
| Network adaptive timeout | NetworkRecoveryManager.adjust_timeout() |
error_mitigation/recovery_engine.py |
| Redundant send | send_with_redundancy() |
error_mitigation/recovery_engine.py |
| Leader failure | ConsensusRecoveryEngine.handle_leader_failure() |
error_mitigation/recovery_engine.py |
| View change trigger | BFTViewChangeManager._initiate_view_change() |
consensus/bft/consensus.py |
| Snapshot load | RollbackManager.load_snapshot() |
error_mitigation/rollback_manager.py |
| Journal replay | TransactionJournal.replay() |
error_mitigation/journal.py |
| State validate | DataValidator.validate() |
error_mitigation/validator.py |
Related
- BFT Consensus β View Change detail
- Cluster Lockdown β cluster-level recovery
- Chain Rehydration β full chain reload from DB
- Risk Analysis & Alerts β escalation notifications