Monitoring Module (`hierachain/monitoring/*`)

Overview

The Monitoring module provides 360-degree observability for the HieraChain system. It not only tracks traditional infrastructure metrics (CPU, RAM, Disk) but also deeply monitors blockchain-specific indicators such as event throughput, block closing time, and BFT consensus success rate.

Main Components

Performance Monitor

File: performance_monitor.py
- Collects real-time metrics from the system and HieraChain processes.
- Supports custom metrics via callback functions.
- Computes Health Score for instant system health assessment.
Alert System

File: alert_system.py
- Manages alert lifecycle: from detection, notification to acknowledgment and resolution.
- Supports multiple notification channels: Email (SMTP/TLS) and Webhooks.
- Automatic Escalation mechanism when alerts are not handled.
Anomaly Detector

File: alert_system.py
- Detects abnormal behavior based on Z-Score algorithm.
- Analyzes historical data in sliding windows to determine standard deviation.
- Helps early detection of DDoS attacks or bottleneck congestion.
Blockchain Metrics

File: performance_metrics.py
- Throughput: Number of events processed per second (EPS).
- Latency: Average time for an event to be validated and block-closed.
- Consensus Health: Ratio of successful consensus rounds and convergence time.

Monitoring and Alert Workflow

The system operates in a continuous loop to ensure high availability:

graph LR
    subgraph "Data Collection"
        A[System Metrics]
        B[Blockchain Metrics]
        C[Custom Callbacks]
    end

    subgraph "Processing Engine"
        D[Performance Monitor]
        E[Anomaly Detector]
    end

    subgraph "Response Layer"
        F[Health Report]
        G[Alert Manager]
    end

    A & B & C --> D
    D --> E
    E --> G
    D --> F
    G --> H[Email/Webhook Notification]

System Health Score

HieraChain computes an overall health score (0-100) based on weighted alert thresholds:

Status	Score	Meaning
Excellent	90 - 100	System operating perfectly, no alerts.
Good	70 - 89	Stable operation, possibly a few minor alerts.
Poor	< 70	Performance noticeably affected, needs review.
Critical	N/A	At least one metric at Critical Alert level.

Deployment Example

1. Start Performance Monitoring

from hierachain.monitoring import PerformanceMonitor

monitor = PerformanceMonitor(config={"collection_interval": 10.0})
monitor.start_monitoring()

# Get instant health report
health_score, status = monitor.get_health_score()
print(f"System Health: {status} ({health_score}/100)")

2. Define Alert Rules

from hierachain.monitoring.alert_system import AlertRule, AlertSeverity, AlertCategory

rule = AlertRule(
    rule_id="TPS_DROP",
    name="Sharp throughput drop",
    description="Event throughput dropped below minimum threshold",
    category=AlertCategory.PERFORMANCE,
    metric_name="event_throughput",
    condition="less_than",
    threshold=10.0,
    severity=AlertSeverity.CRITICAL,
    escalation_time=600  # Escalate after 10 minutes if not handled
)
alert_manager.add_alert_rule(rule)

Notifications and Escalation

When an alert is created but not Acknowledged within the specified time:

The system automatically increases the severity level (e.g., from WARNING to CRITICAL).
Sends additional notifications to emergency recipient lists via Email/Webhook.
Records detailed logs in the Audit system for post-incident investigation.

Monitoring Module (hierachain/monitoring/*)