Reliability Guide
Purpose
Provides practices to ensure stable system operation and easy recovery from failures.
Related Components
- Journal/Recovery:
hierachain/error_mitigation/{journal.py, recovery_engine.py, rollback_manager.py} - Cross-level Sync:
hierachain/cluster/{cross_level_sync.py, state_sync_manager.py}(if applicable)
Patterns
- Durable Journal: write before applying changes.
- Safe Rollback: state can return to a safe point.
- Automatic Recovery: standard scenarios for connection loss/DB errors.
- Idempotency + Retry with backoff: repeat actions without duplicating effects.
Implementation Recommendations
- Apply journal for important state-changing operations.
- Set reasonable thresholds and timeouts for retry; ensure idempotency keys.
- Use metrics/alert to detect abnormal retry loops.