# Monitoring & Status Guide LogWisp provides comprehensive monitoring capabilities through status endpoints, operational logs, and metrics. ## Status Endpoints ### Stream Status Each stream exposes its own status endpoint: ```bash # Standalone mode curl http://localhost:8080/status # Router mode curl http://localhost:8080/streamname/status ``` Example response: ```json { "service": "LogWisp", "version": "1.0.0", "server": { "type": "http", "port": 8080, "active_clients": 5, "buffer_size": 1000, "uptime_seconds": 3600, "mode": { "standalone": true, "router": false } }, "monitor": { "active_watchers": 3, "total_entries": 152341, "dropped_entries": 12, "start_time": "2024-01-20T10:00:00Z", "last_entry_time": "2024-01-20T11:00:00Z" }, "filters": { "filter_count": 2, "total_processed": 152341, "total_passed": 48234, "filters": [ { "type": "include", "logic": "or", "pattern_count": 3, "total_processed": 152341, "total_matched": 48234, "total_dropped": 0 } ] }, "features": { "heartbeat": { "enabled": true, "interval": 30, "format": "comment" }, "rate_limit": { "enabled": true, "total_requests": 8234, "blocked_requests": 89, "active_ips": 12, "total_connections": 5 } } } ``` ### Global Status (Router Mode) In router mode, a global status endpoint provides aggregated information: ```bash curl http://localhost:8080/status ``` ## Key Metrics ### Monitor Metrics Track file watching performance: | Metric | Description | Healthy Range | |--------|-------------|---------------| | `active_watchers` | Number of files being watched | 1-1000 | | `total_entries` | Total log entries processed | Increasing | | `dropped_entries` | Entries dropped due to buffer full | < 1% of total | | `entries_per_second` | Current processing rate | Varies | ### Connection Metrics Monitor client connections: | Metric | Description | Warning Signs | |--------|-------------|---------------| | `active_clients` | Current SSE connections | Near limit | | `tcp_connections` | Current TCP connections | Near limit | | `total_connections` | All active connections | > 80% of max | ### Filter Metrics Understand filtering effectiveness: | Metric | Description | Optimization | |--------|-------------|--------------| | `total_processed` | Entries checked | - | | `total_passed` | Entries that passed | Very low = too restrictive | | `total_dropped` | Entries filtered out | Very high = review patterns | ### Rate Limit Metrics Track rate limiting impact: | Metric | Description | Action Needed | |--------|-------------|---------------| | `blocked_requests` | Rejected requests | High = increase limits | | `active_ips` | Unique clients | High = scale out | | `blocked_percentage` | Rejection rate | > 10% = review | ## Operational Logging ### Log Levels Configure LogWisp's operational logging: ```toml [logging] output = "both" # file and stderr level = "info" # info for production ``` Log levels and their use: - **DEBUG**: Detailed internal operations - **INFO**: Normal operations, connections - **WARN**: Recoverable issues - **ERROR**: Errors requiring attention ### Important Log Messages #### Startup Messages ``` LogWisp starting version=1.0.0 config_file=/etc/logwisp.toml Stream registered with router stream=app TCP endpoint configured transport=system port=9090 HTTP endpoints configured transport=app stream_url=http://localhost:8080/stream ``` #### Connection Events ``` HTTP client connected remote_addr=192.168.1.100:54231 active_clients=6 HTTP client disconnected remote_addr=192.168.1.100:54231 active_clients=5 TCP connection opened remote_addr=192.168.1.100:54232 active_connections=3 ``` #### Error Conditions ``` Failed to open file for checking path=/var/log/app.log error=permission denied Scanner error while reading file path=/var/log/huge.log error=token too long Request rate limited ip=192.168.1.100 Connection limit exceeded ip=192.168.1.100 connections=5 limit=5 ``` #### Performance Warnings ``` Dropped log entry - subscriber buffer full Dropped entry for slow client remote_addr=192.168.1.100 Check interval too small: 5ms (min: 10ms) ``` ## Health Checks ### Basic Health Check Simple up/down check: ```bash #!/bin/bash # health_check.sh STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/status) if [ "$STATUS" -eq 200 ]; then echo "LogWisp is healthy" exit 0 else echo "LogWisp is unhealthy (status: $STATUS)" exit 1 fi ``` ### Advanced Health Check Check specific conditions: ```bash #!/bin/bash # advanced_health_check.sh RESPONSE=$(curl -s http://localhost:8080/status) # Check if processing logs ENTRIES=$(echo "$RESPONSE" | jq -r '.monitor.total_entries') if [ "$ENTRIES" -eq 0 ]; then echo "WARNING: No log entries processed" exit 1 fi # Check dropped entries DROPPED=$(echo "$RESPONSE" | jq -r '.monitor.dropped_entries') TOTAL=$(echo "$RESPONSE" | jq -r '.monitor.total_entries') DROP_PERCENT=$(( DROPPED * 100 / TOTAL )) if [ "$DROP_PERCENT" -gt 5 ]; then echo "WARNING: High drop rate: ${DROP_PERCENT}%" exit 1 fi # Check connections CONNECTIONS=$(echo "$RESPONSE" | jq -r '.server.active_clients') echo "OK: Processing logs, $CONNECTIONS active clients" exit 0 ``` ### Container Health Check Docker/Kubernetes configuration: ```dockerfile # Dockerfile HEALTHCHECK --interval=30s --timeout=3s --retries=3 \ CMD curl -f http://localhost:8080/status || exit 1 ``` ```yaml # Kubernetes livenessProbe: httpGet: path: /status port: 8080 initialDelaySeconds: 10 periodSeconds: 30 readinessProbe: httpGet: path: /status port: 8080 initialDelaySeconds: 5 periodSeconds: 10 ``` ## Monitoring Integration ### Prometheus Metrics Export metrics in Prometheus format: ```bash #!/bin/bash # prometheus_exporter.sh while true; do STATUS=$(curl -s http://localhost:8080/status) # Extract metrics CLIENTS=$(echo "$STATUS" | jq -r '.server.active_clients') ENTRIES=$(echo "$STATUS" | jq -r '.monitor.total_entries') DROPPED=$(echo "$STATUS" | jq -r '.monitor.dropped_entries') # Output Prometheus format cat << EOF # HELP logwisp_active_clients Number of active streaming clients # TYPE logwisp_active_clients gauge logwisp_active_clients $CLIENTS # HELP logwisp_total_entries Total log entries processed # TYPE logwisp_total_entries counter logwisp_total_entries $ENTRIES # HELP logwisp_dropped_entries Total log entries dropped # TYPE logwisp_dropped_entries counter logwisp_dropped_entries $DROPPED EOF sleep 60 done ``` ### Grafana Dashboard Key panels for Grafana: 1. **Active Connections** - Query: `logwisp_active_clients` - Visualization: Graph - Alert: > 80% of max 2. **Log Processing Rate** - Query: `rate(logwisp_total_entries[5m])` - Visualization: Graph - Alert: < 1 entry/min 3. **Drop Rate** - Query: `rate(logwisp_dropped_entries[5m]) / rate(logwisp_total_entries[5m])` - Visualization: Gauge - Alert: > 5% 4. **Rate Limit Rejections** - Query: `rate(logwisp_blocked_requests[5m])` - Visualization: Graph - Alert: > 10/min ### Datadog Integration Send custom metrics: ```bash #!/bin/bash # datadog_metrics.sh while true; do STATUS=$(curl -s http://localhost:8080/status) # Send metrics to Datadog echo "$STATUS" | jq -r ' "logwisp.connections:\(.server.active_clients)|g", "logwisp.entries:\(.monitor.total_entries)|c", "logwisp.dropped:\(.monitor.dropped_entries)|c" ' | while read metric; do echo "$metric" | nc -u -w1 localhost 8125 done sleep 60 done ``` ## Performance Monitoring ### CPU Usage Monitor CPU usage by component: ```bash # Check process CPU top -p $(pgrep logwisp) -b -n 1 # Profile CPU usage go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 ``` Common CPU consumers: - File watching (reduce check_interval_ms) - Regex filtering (simplify patterns) - JSON encoding (reduce clients) ### Memory Usage Track memory consumption: ```bash # Check process memory ps aux | grep logwisp # Detailed memory stats cat /proc/$(pgrep logwisp)/status | grep -E "Vm(RSS|Size)" ``` Memory optimization: - Reduce buffer sizes - Limit connections - Simplify filters ### Network Bandwidth Monitor streaming bandwidth: ```bash # Network statistics netstat -i iftop -i eth0 -f "port 8080" # Connection count ss -tan | grep :8080 | wc -l ``` ## Alerting ### Basic Alerts Essential alerts to configure: | Alert | Condition | Severity | |-------|-----------|----------| | Service Down | Status endpoint fails | Critical | | High Drop Rate | > 10% entries dropped | Warning | | No Log Activity | 0 entries/min for 5 min | Warning | | Connection Limit | > 90% of max connections | Warning | | Rate Limit High | > 20% requests blocked | Warning | ### Alert Script Example monitoring script: ```bash #!/bin/bash # monitor_alerts.sh check_alert() { local name=$1 local condition=$2 local message=$3 if eval "$condition"; then echo "ALERT: $name - $message" # Send to alerting system # curl -X POST https://alerts.example.com/... fi } while true; do STATUS=$(curl -s http://localhost:8080/status) if [ -z "$STATUS" ]; then check_alert "SERVICE_DOWN" "true" "LogWisp not responding" sleep 60 continue fi # Extract metrics DROPPED=$(echo "$STATUS" | jq -r '.monitor.dropped_entries') TOTAL=$(echo "$STATUS" | jq -r '.monitor.total_entries') CLIENTS=$(echo "$STATUS" | jq -r '.server.active_clients') # Check conditions check_alert "HIGH_DROP_RATE" \ "[ $((DROPPED * 100 / TOTAL)) -gt 10 ]" \ "Drop rate above 10%" check_alert "HIGH_CONNECTIONS" \ "[ $CLIENTS -gt 90 ]" \ "Near connection limit: $CLIENTS/100" sleep 60 done ``` ## Troubleshooting with Monitoring ### No Logs Appearing Check monitor stats: ```bash curl -s http://localhost:8080/status | jq '.monitor' ``` Look for: - `active_watchers` = 0 (no files found) - `total_entries` not increasing (files not updating) ### High CPU Usage Enable debug logging: ```bash logwisp --log-level debug --log-output stderr ``` Watch for: - Frequent "checkFile" messages (reduce check_interval) - Many filter operations (optimize patterns) ### Memory Growth Monitor over time: ```bash while true; do ps aux | grep logwisp | grep -v grep curl -s http://localhost:8080/status | jq '.server.active_clients' sleep 10 done ``` ### Connection Issues Check connection stats: ```bash # Current connections curl -s http://localhost:8080/status | jq '.server' # Rate limit stats curl -s http://localhost:8080/status | jq '.features.rate_limit' ``` ## Best Practices 1. **Regular Monitoring**: Check status endpoints every 30-60 seconds 2. **Set Alerts**: Configure alerts for critical conditions 3. **Log Rotation**: Rotate LogWisp's own logs to prevent disk fill 4. **Baseline Metrics**: Establish normal ranges for your environment 5. **Capacity Planning**: Monitor trends for scaling decisions 6. **Test Monitoring**: Verify alerts work before issues occur ## See Also - [Performance Tuning](performance.md) - Optimization guide - [Troubleshooting](troubleshooting.md) - Common issues - [Configuration Guide](configuration.md) - Monitoring configuration - [Integration Examples](integrations.md) - Monitoring system integration