Monitoring & Observability

AxonFlow provides built-in monitoring to help you understand system health, performance, and usage patterns. This guide covers the monitoring capabilities available in the Community edition.

Overview

AxonFlow exposes metrics and health endpoints that integrate with standard monitoring tools:

Health Endpoints - Check service status
Prometheus Metrics - Detailed performance data
Structured Logs - Request and error tracking

Health Endpoints

Agent Health

curl https://YOUR_AGENT_ENDPOINT/health

Response:

{
  "status": "healthy",
  "version": "1.0.0",
  "uptime_seconds": 86400,
  "checks": {
    "database": "ok",
    "redis": "ok"
  }
}

Orchestrator Health

curl https://YOUR_AGENT_ENDPOINT/orchestrator/health

Response:

{
  "status": "healthy",
  "components": {
    "llm_router": true,
    "planning_engine": true
  }
}

Health Check Integration

Use these endpoints for:

Load balancer health checks - ALB/NLB target health
Kubernetes probes - Liveness and readiness
Uptime monitoring - External monitoring services

Example: Kubernetes Probe Configuration

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Prometheus Metrics

AxonFlow exposes Prometheus-compatible metrics at the /metrics endpoint.

Enabling Metrics

Metrics are enabled by default. Access them at:

curl https://YOUR_AGENT_ENDPOINT/metrics

Key Metrics

Request Metrics

Metric	Type	Description
`axonflow_requests_total`	Counter	Total requests processed
`axonflow_request_duration_seconds`	Histogram	Request latency distribution
`axonflow_requests_in_flight`	Gauge	Currently processing requests

Policy Metrics

Metric	Type	Description
`axonflow_policy_evaluations_total`	Counter	Policy evaluations performed
`axonflow_policy_evaluation_duration_seconds`	Histogram	Policy evaluation latency
`axonflow_policy_decisions`	Counter	Decisions by result (allow/deny)

System Metrics

Metric	Type	Description
`axonflow_database_connections`	Gauge	Active database connections
`axonflow_goroutines`	Gauge	Active goroutines
`axonflow_memory_bytes`	Gauge	Memory usage

Prometheus Configuration

Add AxonFlow to your Prometheus scrape config:

scrape_configs:
  - job_name: 'axonflow-agent'
    static_configs:
      - targets: ['YOUR_AGENT_HOST:8080']
    metrics_path: /metrics
    scrape_interval: 15s

  - job_name: 'axonflow-orchestrator'
    static_configs:
      - targets: ['YOUR_ORCHESTRATOR_HOST:8081']
    metrics_path: /metrics
    scrape_interval: 15s

Logging

AxonFlow outputs structured JSON logs for easy parsing and analysis.

Log Format

{
  "level": "info",
  "timestamp": "2025-11-26T10:30:00Z",
  "message": "Request processed",
  "request_id": "req-abc123",
  "duration_ms": 8,
  "user": "user@example.com",
  "action": "mcp:salesforce:query",
  "decision": "allow"
}

Log Levels

Level	Description
`debug`	Detailed debugging information
`info`	Normal operational messages
`warn`	Warning conditions
`error`	Error conditions

Configuring Log Level

Set via environment variable:

export LOG_LEVEL=info  # debug, info, warn, error

CloudWatch Logs (AWS)

When deployed on AWS, logs are automatically sent to CloudWatch:

Log Groups:
/ecs/{STACK_NAME}/agent
/ecs/{STACK_NAME}/orchestrator
/ecs/{STACK_NAME}/customer-portal

View logs:

aws logs tail /ecs/YOUR_STACK/agent --follow --region YOUR_REGION

Search for errors:

aws logs filter-log-events \
  --log-group-name /ecs/YOUR_STACK/agent \
  --filter-pattern "ERROR" \
  --start-time $(date -d '1 hour ago' +%s000)

Basic Alerting

Recommended Alerts

Set up alerts for these conditions:

Condition	Threshold	Severity
Service unhealthy	Health check fails for 1 min	Critical
High error rate	> 1% of requests	Warning
High latency	P99 > 100ms	Warning
Database connection errors	Any	Critical

Prometheus Alerting Rules

groups:
  - name: axonflow
    rules:
      - alert: AxonFlowDown
        expr: up{job="axonflow-agent"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "AxonFlow Agent is down"

      - alert: HighErrorRate
        expr: rate(axonflow_requests_total{status="error"}[5m]) / rate(axonflow_requests_total[5m]) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"

      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(axonflow_request_duration_seconds_bucket[5m])) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency exceeds 100ms"

Docker Compose Monitoring Stack

For local development and testing, use this Docker Compose setup:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=7d'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  grafana-data:

prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'axonflow-agent'
    static_configs:
      - targets: ['host.docker.internal:8080']

  - job_name: 'axonflow-orchestrator'
    static_configs:
      - targets: ['host.docker.internal:8081']

Start the stack:

docker-compose up -d

Access:

Prometheus: http://localhost:9090
Grafana: http://localhost:3000 (admin/admin)

Useful Queries

Request Rate

rate(axonflow_requests_total[5m])

Error Rate

rate(axonflow_requests_total{status="error"}[5m]) / rate(axonflow_requests_total[5m])

P99 Latency

histogram_quantile(0.99, rate(axonflow_request_duration_seconds_bucket[5m]))

Policy Evaluation Time

histogram_quantile(0.95, rate(axonflow_policy_evaluation_duration_seconds_bucket[5m]))

Active Connections

axonflow_database_connections

Best Practices

1. Monitor Key Indicators

Focus on these metrics:

Availability - Health check success rate
Latency - P50, P95, P99 response times
Error rate - Percentage of failed requests
Throughput - Requests per second

2. Set Appropriate Thresholds

Start with conservative thresholds and tune based on baseline:

Measure normal operation for 1-2 weeks
Set warning thresholds at 2x normal
Set critical thresholds at 5x normal

3. Include Context in Alerts

Alert messages should include:

What is happening
What the impact is
Link to runbook or dashboard

4. Test Your Monitoring

Periodically verify:

Alerts fire correctly
Dashboards show accurate data
Log aggregation is working

Next Steps

Token Usage & Cost Tracking - Monitor LLM token usage and costs
Deployment Guide - Deploy with monitoring enabled
Troubleshooting - Debug common issues
Architecture Overview - Understand system components

Enterprise Monitoring

Enterprise deployments include pre-configured Grafana dashboards (Executive Summary, Cost Analytics), advanced alerting, and comprehensive audit logging with export capabilities. Contact sales for details.

Overview​

Health Endpoints​

Agent Health​

Orchestrator Health​

Health Check Integration​

Prometheus Metrics​

Enabling Metrics​

Key Metrics​

Request Metrics​

Policy Metrics​

System Metrics​

Prometheus Configuration​

Logging​

Log Format​

Log Levels​

Configuring Log Level​

CloudWatch Logs (AWS)​

Basic Alerting​

Recommended Alerts​

Prometheus Alerting Rules​

Docker Compose Monitoring Stack​

Useful Queries​

Request Rate​

Error Rate​

P99 Latency​

Policy Evaluation Time​

Active Connections​

Best Practices​

1. Monitor Key Indicators​

2. Set Appropriate Thresholds​

3. Include Context in Alerts​

4. Test Your Monitoring​

Next Steps​