Skip to content

Error Mitigation & Recovery

Overview

HieraChain provides layered, automated recovery mechanisms across three dimensions: network resilience, consensus leader recovery, and state rollback. These operate independently and can be active simultaneously.


6A β€” Network Recovery

flowchart TB
    NET["🌐 Network Issue Detected"]
    LAT["πŸ“Š Collect Latency History"]
    ADJ["⏱️ adjust_timeout()\nRecalculate based on avg + max latency"]
    RED["πŸ“‘ send_with_redundancy()\nSend via N parallel paths simultaneously"]
    FIRST["βœ… First successful response wins"]
    PART["⚠️ Partition Detected?\navg_latency > 5000ms"]
    VC["πŸ”„ _initiate_view_change()\nTrigger BFT view change"]

    NET --> LAT --> ADJ --> RED --> FIRST
    RED --> PART
    PART -->|Yes| VC
    PART -->|No| FIRST

Strategy: send_with_redundancy() dispatches the same message over N parallel network paths simultaneously. The first successful response wins and remaining in-flight calls are cancelled. This handles intermittent path failures without explicit retry logic.


6B β€” Consensus Recovery (Leader Failure)

sequenceDiagram
    autonumber
    participant CRE as πŸ”§ ConsensusRecoveryEngine
    participant VM as πŸ”„ BFTViewChangeManager
    participant NEW as πŸ‘‘ New Leader

    Note over CRE: Leader timeout detected
    CRE->>CRE: handle_leader_failure(failed_leader_id, current_view)
    CRE->>CRE: Check recovery_attempts < max (default 3)
    CRE->>VM: _initiate_view_change(failed_leader, new_view = view + 1)
    VM->>VM: Broadcast VIEW-CHANGE to all validators
    VM->>NEW: Elect new leader: Validators[new_view % n]
    NEW->>NEW: Restart PRE-PREPARE phase
    CRE->>CRE: Clear recovery_attempts on success

6C β€” State Rollback

flowchart LR
    ERR["❌ Critical Error\nor Integrity Failure"]
    SNAP["πŸ“Έ Load Snapshot\n(RollbackManager)"]
    JRNL["πŸ““ Replay Journal\n(TransactionJournal)"]
    VER["πŸ” Validate Restored State\n(DataValidator)"]
    OK["βœ… State Restored"]
    ALERT["🚨 Alert + Escalate\n(AlertManager)"]

    ERR --> SNAP --> JRNL --> VER
    VER -->|Valid| OK
    VER -->|Invalid| ALERT

Rollback steps: 1. RollbackManager.load_snapshot() β€” load the most recent consistent snapshot 2. TransactionJournal.replay() β€” replay committed journal entries since the snapshot 3. DataValidator.validate() β€” verify the restored state against cryptographic checksums 4. If validation fails: escalation alert sent via Risk Alerts; manual intervention required


Step-by-Step Breakdown

Sub-flow Trigger Action
6A Network avg_latency > threshold Adaptive timeout + parallel redundant send
6A Partition avg_latency > 5000ms Trigger BFT View Change (BFT Consensus)
6B Leader leader_timeout ConsensusRecoveryEngine.handle_leader_failure() β†’ View Change
6B Max retries recovery_attempts β‰₯ max Log critical error, alert, halt consensus
6C Rollback Integrity failure or critical error Snapshot β†’ Journal replay β†’ Validate

Error Handling

Condition Behavior
All recovery paths exhausted (6B) Critical alert sent, node halts consensus participation
Snapshot not found (6C) Full rehydration from DB (Chain Rehydration) attempted
Journal replay produces invalid state (6C) Alert escalated via Risk Alerts, manual intervention flagged
Network partition heals Adaptive timeout reduces automatically, normal flow resumes

Key Classes & Methods

Step Class / Method File
Network adaptive timeout NetworkRecoveryManager.adjust_timeout() error_mitigation/recovery_engine.py
Redundant send send_with_redundancy() error_mitigation/recovery_engine.py
Leader failure ConsensusRecoveryEngine.handle_leader_failure() error_mitigation/recovery_engine.py
View change trigger BFTViewChangeManager._initiate_view_change() consensus/bft/consensus.py
Snapshot load RollbackManager.load_snapshot() error_mitigation/rollback_manager.py
Journal replay TransactionJournal.replay() error_mitigation/journal.py
State validate DataValidator.validate() error_mitigation/validator.py