Skip to main content

Monitoring & Observability

AxonFlow provides built-in monitoring to help you understand system health, performance, and usage patterns. This guide covers the monitoring capabilities available in the Community edition.

Overview

AxonFlow exposes metrics and health endpoints that integrate with standard monitoring tools:

  • Health Endpoints - Check service status
  • Prometheus Metrics - Detailed performance data
  • Structured Logs - Request and error tracking

Health Endpoints

Agent Health

curl https://YOUR_AGENT_ENDPOINT/health

Response:

{
"status": "healthy",
"version": "1.0.0",
"uptime_seconds": 86400,
"checks": {
"database": "ok",
"redis": "ok"
}
}

Orchestrator Health

curl https://YOUR_AGENT_ENDPOINT/orchestrator/health

Response:

{
"status": "healthy",
"components": {
"llm_router": true,
"planning_engine": true
}
}

Health Check Integration

Use these endpoints for:

  • Load balancer health checks - ALB/NLB target health
  • Kubernetes probes - Liveness and readiness
  • Uptime monitoring - External monitoring services

Example: Kubernetes Probe Configuration

livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10

readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5

Prometheus Metrics

AxonFlow exposes Prometheus-compatible metrics at the /metrics endpoint.

Enabling Metrics

Metrics are enabled by default. Access them at:

curl https://YOUR_AGENT_ENDPOINT/metrics

Key Metrics

Request Metrics

MetricTypeDescription
axonflow_requests_totalCounterTotal requests processed
axonflow_request_duration_secondsHistogramRequest latency distribution
axonflow_requests_in_flightGaugeCurrently processing requests

Policy Metrics

MetricTypeDescription
axonflow_policy_evaluations_totalCounterPolicy evaluations performed
axonflow_policy_evaluation_duration_secondsHistogramPolicy evaluation latency
axonflow_policy_decisionsCounterDecisions by result (allow/deny)

System Metrics

MetricTypeDescription
axonflow_database_connectionsGaugeActive database connections
axonflow_goroutinesGaugeActive goroutines
axonflow_memory_bytesGaugeMemory usage

Prometheus Configuration

Add AxonFlow to your Prometheus scrape config:

scrape_configs:
- job_name: 'axonflow-agent'
static_configs:
- targets: ['YOUR_AGENT_HOST:8080']
metrics_path: /metrics
scrape_interval: 15s

- job_name: 'axonflow-orchestrator'
static_configs:
- targets: ['YOUR_ORCHESTRATOR_HOST:8081']
metrics_path: /metrics
scrape_interval: 15s

Logging

AxonFlow outputs structured JSON logs for easy parsing and analysis.

Log Format

{
"level": "info",
"timestamp": "2025-11-26T10:30:00Z",
"message": "Request processed",
"request_id": "req-abc123",
"duration_ms": 8,
"user": "user@example.com",
"action": "mcp:salesforce:query",
"decision": "allow"
}

Log Levels

LevelDescription
debugDetailed debugging information
infoNormal operational messages
warnWarning conditions
errorError conditions

Configuring Log Level

Set via environment variable:

export LOG_LEVEL=info  # debug, info, warn, error

CloudWatch Logs (AWS)

When deployed on AWS, logs are automatically sent to CloudWatch:

Log Groups:
/ecs/{STACK_NAME}/agent
/ecs/{STACK_NAME}/orchestrator
/ecs/{STACK_NAME}/customer-portal

View logs:

aws logs tail /ecs/YOUR_STACK/agent --follow --region YOUR_REGION

Search for errors:

aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "ERROR" \
--start-time $(date -d '1 hour ago' +%s000)

Basic Alerting

Set up alerts for these conditions:

ConditionThresholdSeverity
Service unhealthyHealth check fails for 1 minCritical
High error rate> 1% of requestsWarning
High latencyP99 > 100msWarning
Database connection errorsAnyCritical

Prometheus Alerting Rules

groups:
- name: axonflow
rules:
- alert: AxonFlowDown
expr: up{job="axonflow-agent"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "AxonFlow Agent is down"

- alert: HighErrorRate
expr: rate(axonflow_requests_total{status="error"}[5m]) / rate(axonflow_requests_total[5m]) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"

- alert: HighLatency
expr: histogram_quantile(0.99, rate(axonflow_request_duration_seconds_bucket[5m])) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "P99 latency exceeds 100ms"

Docker Compose Monitoring Stack

For local development and testing, use this Docker Compose setup:

version: '3.8'

services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=7d'

grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana

volumes:
grafana-data:

prometheus.yml:

global:
scrape_interval: 15s

scrape_configs:
- job_name: 'axonflow-agent'
static_configs:
- targets: ['host.docker.internal:8080']

- job_name: 'axonflow-orchestrator'
static_configs:
- targets: ['host.docker.internal:8081']

Start the stack:

docker-compose up -d

Access:


Useful Queries

Request Rate

rate(axonflow_requests_total[5m])

Error Rate

rate(axonflow_requests_total{status="error"}[5m]) / rate(axonflow_requests_total[5m])

P99 Latency

histogram_quantile(0.99, rate(axonflow_request_duration_seconds_bucket[5m]))

Policy Evaluation Time

histogram_quantile(0.95, rate(axonflow_policy_evaluation_duration_seconds_bucket[5m]))

Active Connections

axonflow_database_connections

Best Practices

1. Monitor Key Indicators

Focus on these metrics:

  • Availability - Health check success rate
  • Latency - P50, P95, P99 response times
  • Error rate - Percentage of failed requests
  • Throughput - Requests per second

2. Set Appropriate Thresholds

Start with conservative thresholds and tune based on baseline:

  • Measure normal operation for 1-2 weeks
  • Set warning thresholds at 2x normal
  • Set critical thresholds at 5x normal

3. Include Context in Alerts

Alert messages should include:

  • What is happening
  • What the impact is
  • Link to runbook or dashboard

4. Test Your Monitoring

Periodically verify:

  • Alerts fire correctly
  • Dashboards show accurate data
  • Log aggregation is working

Next Steps


Enterprise Monitoring

Enterprise deployments include pre-configured Grafana dashboards (Executive Summary, Cost Analytics), advanced alerting, and comprehensive audit logging with export capabilities. Contact sales for details.