Monitoring & Observability
AxonFlow provides built-in monitoring to help you understand system health, performance, and usage patterns. This guide covers the monitoring capabilities available in the Community edition.
Overview
AxonFlow exposes metrics and health endpoints that integrate with standard monitoring tools:
- Health Endpoints - Check service status
- Prometheus Metrics - Detailed performance data
- Structured Logs - Request and error tracking
Health Endpoints
Agent Health
curl https://YOUR_AGENT_ENDPOINT/health
Response:
{
"status": "healthy",
"version": "1.0.0",
"uptime_seconds": 86400,
"checks": {
"database": "ok",
"redis": "ok"
}
}
Orchestrator Health
curl https://YOUR_AGENT_ENDPOINT/orchestrator/health
Response:
{
"status": "healthy",
"components": {
"llm_router": true,
"planning_engine": true
}
}
Health Check Integration
Use these endpoints for:
- Load balancer health checks - ALB/NLB target health
- Kubernetes probes - Liveness and readiness
- Uptime monitoring - External monitoring services
Example: Kubernetes Probe Configuration
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Prometheus Metrics
AxonFlow exposes Prometheus-compatible metrics at the /metrics endpoint.
Enabling Metrics
Metrics are enabled by default. Access them at:
curl https://YOUR_AGENT_ENDPOINT/metrics
Key Metrics
Request Metrics
| Metric | Type | Description |
|---|---|---|
axonflow_requests_total | Counter | Total requests processed |
axonflow_request_duration_seconds | Histogram | Request latency distribution |
axonflow_requests_in_flight | Gauge | Currently processing requests |
Policy Metrics
| Metric | Type | Description |
|---|---|---|
axonflow_policy_evaluations_total | Counter | Policy evaluations performed |
axonflow_policy_evaluation_duration_seconds | Histogram | Policy evaluation latency |
axonflow_policy_decisions | Counter | Decisions by result (allow/deny) |
System Metrics
| Metric | Type | Description |
|---|---|---|
axonflow_database_connections | Gauge | Active database connections |
axonflow_goroutines | Gauge | Active goroutines |
axonflow_memory_bytes | Gauge | Memory usage |
Prometheus Configuration
Add AxonFlow to your Prometheus scrape config:
scrape_configs:
- job_name: 'axonflow-agent'
static_configs:
- targets: ['YOUR_AGENT_HOST:8080']
metrics_path: /metrics
scrape_interval: 15s
- job_name: 'axonflow-orchestrator'
static_configs:
- targets: ['YOUR_ORCHESTRATOR_HOST:8081']
metrics_path: /metrics
scrape_interval: 15s
Logging
AxonFlow outputs structured JSON logs for easy parsing and analysis.
Log Format
{
"level": "info",
"timestamp": "2025-11-26T10:30:00Z",
"message": "Request processed",
"request_id": "req-abc123",
"duration_ms": 8,
"user": "user@example.com",
"action": "mcp:salesforce:query",
"decision": "allow"
}
Log Levels
| Level | Description |
|---|---|
debug | Detailed debugging information |
info | Normal operational messages |
warn | Warning conditions |
error | Error conditions |
Configuring Log Level
Set via environment variable:
export LOG_LEVEL=info # debug, info, warn, error
CloudWatch Logs (AWS)
When deployed on AWS, logs are automatically sent to CloudWatch:
Log Groups:
/ecs/{STACK_NAME}/agent
/ecs/{STACK_NAME}/orchestrator
/ecs/{STACK_NAME}/customer-portal
View logs:
aws logs tail /ecs/YOUR_STACK/agent --follow --region YOUR_REGION
Search for errors:
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "ERROR" \
--start-time $(date -d '1 hour ago' +%s000)
Basic Alerting
Recommended Alerts
Set up alerts for these conditions:
| Condition | Threshold | Severity |
|---|---|---|
| Service unhealthy | Health check fails for 1 min | Critical |
| High error rate | > 1% of requests | Warning |
| High latency | P99 > 100ms | Warning |
| Database connection errors | Any | Critical |
Prometheus Alerting Rules
groups:
- name: axonflow
rules:
- alert: AxonFlowDown
expr: up{job="axonflow-agent"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "AxonFlow Agent is down"
- alert: HighErrorRate
expr: rate(axonflow_requests_total{status="error"}[5m]) / rate(axonflow_requests_total[5m]) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
- alert: HighLatency
expr: histogram_quantile(0.99, rate(axonflow_request_duration_seconds_bucket[5m])) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "P99 latency exceeds 100ms"
Docker Compose Monitoring Stack
For local development and testing, use this Docker Compose setup:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=7d'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
volumes:
grafana-data:
prometheus.yml:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'axonflow-agent'
static_configs:
- targets: ['host.docker.internal:8080']
- job_name: 'axonflow-orchestrator'
static_configs:
- targets: ['host.docker.internal:8081']
Start the stack:
docker-compose up -d
Access:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
Useful Queries
Request Rate
rate(axonflow_requests_total[5m])
Error Rate
rate(axonflow_requests_total{status="error"}[5m]) / rate(axonflow_requests_total[5m])
P99 Latency
histogram_quantile(0.99, rate(axonflow_request_duration_seconds_bucket[5m]))
Policy Evaluation Time
histogram_quantile(0.95, rate(axonflow_policy_evaluation_duration_seconds_bucket[5m]))
Active Connections
axonflow_database_connections
Best Practices
1. Monitor Key Indicators
Focus on these metrics:
- Availability - Health check success rate
- Latency - P50, P95, P99 response times
- Error rate - Percentage of failed requests
- Throughput - Requests per second
2. Set Appropriate Thresholds
Start with conservative thresholds and tune based on baseline:
- Measure normal operation for 1-2 weeks
- Set warning thresholds at 2x normal
- Set critical thresholds at 5x normal
3. Include Context in Alerts
Alert messages should include:
- What is happening
- What the impact is
- Link to runbook or dashboard
4. Test Your Monitoring
Periodically verify:
- Alerts fire correctly
- Dashboards show accurate data
- Log aggregation is working
Next Steps
- Token Usage & Cost Tracking - Monitor LLM token usage and costs
- Deployment Guide - Deploy with monitoring enabled
- Troubleshooting - Debug common issues
- Architecture Overview - Understand system components
Enterprise deployments include pre-configured Grafana dashboards (Executive Summary, Cost Analytics), advanced alerting, and comprehensive audit logging with export capabilities. Contact sales for details.