Aegis Sentinel is a high-reliability infrastructure monitoring and auto-healing engine designed to ensure business continuity and infrastructure resilience in enterprise environments. This system implements advanced anomaly detection algorithms and automated recovery mechanisms to prevent catastrophic downtime in digital services.
The platform demonstrates advanced engineering principles suitable for international standards, particularly aligned with EB-2 NIW requirements for demonstrating exceptional ability in systems engineering and infrastructure automation.
Aegis Sentinel employs a modular, service-oriented architecture with three core components:
graph LR
A[Metrics Collection] --> B[ML Detection]
B --> C{Anomaly?}
C -- No --> A
C -- Yes --> D[Recovery Engine]
D --> E[Self-Healing Action]
In today’s digital economy, infrastructure reliability is critical for maintaining essential services. Aegis Sentinel addresses the national interest by:
| Metric | Target | National Interest Benefit | | :— | :— | :— | | Detection Latency | < 2.0s | Real-time response to critical failures | | False Positive Rate | < 5% | Minimizes operational disruption | | Recovery Success | > 95% | Ensures high availability for essential services |
The system’s ability to detect and remediate issues before they escalate to service outages directly contributes to the stability and reliability of digital services that are essential to modern society.
git clone https://github.com/PkLavc/aegis-sentinel.git
cd aegis-sentinel
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
from src.main import AegisSentinel
# Initialize the system with default configuration
sentinel = AegisSentinel()
# Start monitoring
await sentinel.start()
# The system will automatically detect anomalies and trigger recovery
# Monitor runs in the background and logs all activities
from src.main import AegisSentinel
from src.monitor import MonitoringConfig
from src.detector import DetectionConfig
from src.healer import RecoveryConfig
# Custom configuration
monitoring_config = MonitoringConfig(
collection_interval=5.0, # Check every 5 seconds
api_endpoints=["https://api.example.com/health"],
enable_network_monitoring=True,
enable_disk_monitoring=True
)
detection_config = DetectionConfig(
isolation_contamination=0.1,
statistical_threshold_multiplier=3.0,
min_samples_for_detection=50
)
recovery_config = RecoveryConfig(
enable_docker_recovery=True,
enable_cache_recovery=True,
max_concurrent_actions=3
)
# Initialize with custom configuration
sentinel = AegisSentinel(
monitoring_config=monitoring_config,
detection_config=detection_config,
recovery_config=recovery_config
)
# Start monitoring
await sentinel.start()
# Get service status
status = sentinel.get_status()
print(f"Service running: {status['service_status']}")
print(f"Anomalies detected: {status['anomalies_detected']}")
print(f"Recovery success rate: {status['success_rate']:.1f}%")
import asyncio
from src.main import aegis_sentinel_context
async def main():
# Use context manager for automatic cleanup
async with aegis_sentinel_context() as sentinel:
print("Aegis Sentinel is running...")
# Monitor for 60 seconds
await asyncio.sleep(60)
# Get current status
status = sentinel.get_status()
print(f"Uptime: {status['uptime_seconds']} seconds")
if __name__ == "__main__":
asyncio.run(main())
collection_interval: Time between metric collections (default: 5.0 seconds)api_endpoints: List of API endpoints to monitor for latency and availabilityenable_network_monitoring: Enable network metrics collectionenable_disk_monitoring: Enable disk usage monitoringisolation_contamination: Expected proportion of anomalies in the data (default: 0.1)isolation_n_estimators: Number of trees in Isolation Forest (default: 100)statistical_threshold_multiplier: Standard deviation multiplier for statistical detection (default: 3.0)min_samples_for_detection: Minimum samples required before detection starts (default: 50)enable_docker_recovery: Enable Docker container restart recoveryenable_cache_recovery: Enable cache flushing recoveryenable_service_recovery: Enable system service restart recoverymax_concurrent_actions: Maximum concurrent recovery actions (default: 3)action_timeout: Default timeout for recovery actions (default: 120.0 seconds)Run the test suite:
# Using the simple test runner
python run_tests.py
# Using pytest (if available)
pytest tests/
# Run specific test modules
pytest tests/test_monitor.py
pytest tests/test_detector.py
Aegis Sentinel uses structured JSON logging for enterprise auditability:
{
"timestamp": "2026-02-18T10:15:30Z",
"level": "WARNING",
"event": "anomaly_detected",
"metric": "memory_usage",
"value": 92.5,
"action": "docker_container_restart",
"target": "api_gateway_v1"
}
import logging
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
# Log files are automatically created as 'aegis_sentinel.log'
Log levels:
INFO: General operation status and recovery actionsWARNING: Anomaly detection eventsERROR: System errors and failed recovery actionsDEBUG: Detailed metric collection and detection informationThis project follows strict professional standards. All contributions must include:
MIT License - see LICENSE file for details.
Patrick - Computer Engineer To view other projects and portfolio details, visit: https://pklavc.github.io/projects.html
This project is part of a professional portfolio demonstrating advanced systems engineering capabilities for high-availability infrastructure.