511 lines
11 KiB
Markdown
511 lines
11 KiB
Markdown
# Monitoring & Status Guide
|
|
|
|
LogWisp provides comprehensive monitoring capabilities through status endpoints, operational logs, and metrics.
|
|
|
|
## Status Endpoints
|
|
|
|
### Stream Status
|
|
|
|
Each stream exposes its own status endpoint:
|
|
|
|
```bash
|
|
# Standalone mode
|
|
curl http://localhost:8080/status
|
|
|
|
# Router mode
|
|
curl http://localhost:8080/streamname/status
|
|
```
|
|
|
|
Example response:
|
|
```json
|
|
{
|
|
"service": "LogWisp",
|
|
"version": "1.0.0",
|
|
"server": {
|
|
"type": "http",
|
|
"port": 8080,
|
|
"active_clients": 5,
|
|
"buffer_size": 1000,
|
|
"uptime_seconds": 3600,
|
|
"mode": {
|
|
"standalone": true,
|
|
"router": false
|
|
}
|
|
},
|
|
"monitor": {
|
|
"active_watchers": 3,
|
|
"total_entries": 152341,
|
|
"dropped_entries": 12,
|
|
"start_time": "2024-01-20T10:00:00Z",
|
|
"last_entry_time": "2024-01-20T11:00:00Z"
|
|
},
|
|
"filters": {
|
|
"filter_count": 2,
|
|
"total_processed": 152341,
|
|
"total_passed": 48234,
|
|
"filters": [
|
|
{
|
|
"type": "include",
|
|
"logic": "or",
|
|
"pattern_count": 3,
|
|
"total_processed": 152341,
|
|
"total_matched": 48234,
|
|
"total_dropped": 0
|
|
}
|
|
]
|
|
},
|
|
"features": {
|
|
"heartbeat": {
|
|
"enabled": true,
|
|
"interval": 30,
|
|
"format": "comment"
|
|
},
|
|
"rate_limit": {
|
|
"enabled": true,
|
|
"total_requests": 8234,
|
|
"blocked_requests": 89,
|
|
"active_ips": 12,
|
|
"total_connections": 5
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Global Status (Router Mode)
|
|
|
|
In router mode, a global status endpoint provides aggregated information:
|
|
|
|
```bash
|
|
curl http://localhost:8080/status
|
|
```
|
|
|
|
## Key Metrics
|
|
|
|
### Monitor Metrics
|
|
|
|
Track file watching performance:
|
|
|
|
| Metric | Description | Healthy Range |
|
|
|--------|-------------|---------------|
|
|
| `active_watchers` | Number of files being watched | 1-1000 |
|
|
| `total_entries` | Total log entries processed | Increasing |
|
|
| `dropped_entries` | Entries dropped due to buffer full | < 1% of total |
|
|
| `entries_per_second` | Current processing rate | Varies |
|
|
|
|
### Connection Metrics
|
|
|
|
Monitor client connections:
|
|
|
|
| Metric | Description | Warning Signs |
|
|
|--------|-------------|---------------|
|
|
| `active_clients` | Current SSE connections | Near limit |
|
|
| `tcp_connections` | Current TCP connections | Near limit |
|
|
| `total_connections` | All active connections | > 80% of max |
|
|
|
|
### Filter Metrics
|
|
|
|
Understand filtering effectiveness:
|
|
|
|
| Metric | Description | Optimization |
|
|
|--------|-------------|--------------|
|
|
| `total_processed` | Entries checked | - |
|
|
| `total_passed` | Entries that passed | Very low = too restrictive |
|
|
| `total_dropped` | Entries filtered out | Very high = review patterns |
|
|
|
|
### Rate Limit Metrics
|
|
|
|
Track rate limiting impact:
|
|
|
|
| Metric | Description | Action Needed |
|
|
|--------|-------------|---------------|
|
|
| `blocked_requests` | Rejected requests | High = increase limits |
|
|
| `active_ips` | Unique clients | High = scale out |
|
|
| `blocked_percentage` | Rejection rate | > 10% = review |
|
|
|
|
## Operational Logging
|
|
|
|
### Log Levels
|
|
|
|
Configure LogWisp's operational logging:
|
|
|
|
```toml
|
|
[logging]
|
|
output = "both" # file and stderr
|
|
level = "info" # info for production
|
|
```
|
|
|
|
Log levels and their use:
|
|
- **DEBUG**: Detailed internal operations
|
|
- **INFO**: Normal operations, connections
|
|
- **WARN**: Recoverable issues
|
|
- **ERROR**: Errors requiring attention
|
|
|
|
### Important Log Messages
|
|
|
|
#### Startup Messages
|
|
```
|
|
LogWisp starting version=1.0.0 config_file=/etc/logwisp.toml
|
|
Stream registered with router stream=app
|
|
TCP endpoint configured transport=system port=9090
|
|
HTTP endpoints configured transport=app stream_url=http://localhost:8080/stream
|
|
```
|
|
|
|
#### Connection Events
|
|
```
|
|
HTTP client connected remote_addr=192.168.1.100:54231 active_clients=6
|
|
HTTP client disconnected remote_addr=192.168.1.100:54231 active_clients=5
|
|
TCP connection opened remote_addr=192.168.1.100:54232 active_connections=3
|
|
```
|
|
|
|
#### Error Conditions
|
|
```
|
|
Failed to open file for checking path=/var/log/app.log error=permission denied
|
|
Scanner error while reading file path=/var/log/huge.log error=token too long
|
|
Request rate limited ip=192.168.1.100
|
|
Connection limit exceeded ip=192.168.1.100 connections=5 limit=5
|
|
```
|
|
|
|
#### Performance Warnings
|
|
```
|
|
Dropped log entry - subscriber buffer full
|
|
Dropped entry for slow client remote_addr=192.168.1.100
|
|
Check interval too small: 5ms (min: 10ms)
|
|
```
|
|
|
|
## Health Checks
|
|
|
|
### Basic Health Check
|
|
|
|
Simple up/down check:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# health_check.sh
|
|
|
|
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/status)
|
|
|
|
if [ "$STATUS" -eq 200 ]; then
|
|
echo "LogWisp is healthy"
|
|
exit 0
|
|
else
|
|
echo "LogWisp is unhealthy (status: $STATUS)"
|
|
exit 1
|
|
fi
|
|
```
|
|
|
|
### Advanced Health Check
|
|
|
|
Check specific conditions:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# advanced_health_check.sh
|
|
|
|
RESPONSE=$(curl -s http://localhost:8080/status)
|
|
|
|
# Check if processing logs
|
|
ENTRIES=$(echo "$RESPONSE" | jq -r '.monitor.total_entries')
|
|
if [ "$ENTRIES" -eq 0 ]; then
|
|
echo "WARNING: No log entries processed"
|
|
exit 1
|
|
fi
|
|
|
|
# Check dropped entries
|
|
DROPPED=$(echo "$RESPONSE" | jq -r '.monitor.dropped_entries')
|
|
TOTAL=$(echo "$RESPONSE" | jq -r '.monitor.total_entries')
|
|
DROP_PERCENT=$(( DROPPED * 100 / TOTAL ))
|
|
|
|
if [ "$DROP_PERCENT" -gt 5 ]; then
|
|
echo "WARNING: High drop rate: ${DROP_PERCENT}%"
|
|
exit 1
|
|
fi
|
|
|
|
# Check connections
|
|
CONNECTIONS=$(echo "$RESPONSE" | jq -r '.server.active_clients')
|
|
echo "OK: Processing logs, $CONNECTIONS active clients"
|
|
exit 0
|
|
```
|
|
|
|
### Container Health Check
|
|
|
|
Docker/Kubernetes configuration:
|
|
|
|
```dockerfile
|
|
# Dockerfile
|
|
HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
|
|
CMD curl -f http://localhost:8080/status || exit 1
|
|
```
|
|
|
|
```yaml
|
|
# Kubernetes
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /status
|
|
port: 8080
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 30
|
|
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /status
|
|
port: 8080
|
|
initialDelaySeconds: 5
|
|
periodSeconds: 10
|
|
```
|
|
|
|
## Monitoring Integration
|
|
|
|
### Prometheus Metrics
|
|
|
|
Export metrics in Prometheus format:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# prometheus_exporter.sh
|
|
|
|
while true; do
|
|
STATUS=$(curl -s http://localhost:8080/status)
|
|
|
|
# Extract metrics
|
|
CLIENTS=$(echo "$STATUS" | jq -r '.server.active_clients')
|
|
ENTRIES=$(echo "$STATUS" | jq -r '.monitor.total_entries')
|
|
DROPPED=$(echo "$STATUS" | jq -r '.monitor.dropped_entries')
|
|
|
|
# Output Prometheus format
|
|
cat << EOF
|
|
# HELP logwisp_active_clients Number of active streaming clients
|
|
# TYPE logwisp_active_clients gauge
|
|
logwisp_active_clients $CLIENTS
|
|
|
|
# HELP logwisp_total_entries Total log entries processed
|
|
# TYPE logwisp_total_entries counter
|
|
logwisp_total_entries $ENTRIES
|
|
|
|
# HELP logwisp_dropped_entries Total log entries dropped
|
|
# TYPE logwisp_dropped_entries counter
|
|
logwisp_dropped_entries $DROPPED
|
|
EOF
|
|
|
|
sleep 60
|
|
done
|
|
```
|
|
|
|
### Grafana Dashboard
|
|
|
|
Key panels for Grafana:
|
|
|
|
1. **Active Connections**
|
|
- Query: `logwisp_active_clients`
|
|
- Visualization: Graph
|
|
- Alert: > 80% of max
|
|
|
|
2. **Log Processing Rate**
|
|
- Query: `rate(logwisp_total_entries[5m])`
|
|
- Visualization: Graph
|
|
- Alert: < 1 entry/min
|
|
|
|
3. **Drop Rate**
|
|
- Query: `rate(logwisp_dropped_entries[5m]) / rate(logwisp_total_entries[5m])`
|
|
- Visualization: Gauge
|
|
- Alert: > 5%
|
|
|
|
4. **Rate Limit Rejections**
|
|
- Query: `rate(logwisp_blocked_requests[5m])`
|
|
- Visualization: Graph
|
|
- Alert: > 10/min
|
|
|
|
### Datadog Integration
|
|
|
|
Send custom metrics:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# datadog_metrics.sh
|
|
|
|
while true; do
|
|
STATUS=$(curl -s http://localhost:8080/status)
|
|
|
|
# Send metrics to Datadog
|
|
echo "$STATUS" | jq -r '
|
|
"logwisp.connections:\(.server.active_clients)|g",
|
|
"logwisp.entries:\(.monitor.total_entries)|c",
|
|
"logwisp.dropped:\(.monitor.dropped_entries)|c"
|
|
' | while read metric; do
|
|
echo "$metric" | nc -u -w1 localhost 8125
|
|
done
|
|
|
|
sleep 60
|
|
done
|
|
```
|
|
|
|
## Performance Monitoring
|
|
|
|
### CPU Usage
|
|
|
|
Monitor CPU usage by component:
|
|
|
|
```bash
|
|
# Check process CPU
|
|
top -p $(pgrep logwisp) -b -n 1
|
|
|
|
# Profile CPU usage
|
|
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
|
|
```
|
|
|
|
Common CPU consumers:
|
|
- File watching (reduce check_interval_ms)
|
|
- Regex filtering (simplify patterns)
|
|
- JSON encoding (reduce clients)
|
|
|
|
### Memory Usage
|
|
|
|
Track memory consumption:
|
|
|
|
```bash
|
|
# Check process memory
|
|
ps aux | grep logwisp
|
|
|
|
# Detailed memory stats
|
|
cat /proc/$(pgrep logwisp)/status | grep -E "Vm(RSS|Size)"
|
|
```
|
|
|
|
Memory optimization:
|
|
- Reduce buffer sizes
|
|
- Limit connections
|
|
- Simplify filters
|
|
|
|
### Network Bandwidth
|
|
|
|
Monitor streaming bandwidth:
|
|
|
|
```bash
|
|
# Network statistics
|
|
netstat -i
|
|
iftop -i eth0 -f "port 8080"
|
|
|
|
# Connection count
|
|
ss -tan | grep :8080 | wc -l
|
|
```
|
|
|
|
## Alerting
|
|
|
|
### Basic Alerts
|
|
|
|
Essential alerts to configure:
|
|
|
|
| Alert | Condition | Severity |
|
|
|-------|-----------|----------|
|
|
| Service Down | Status endpoint fails | Critical |
|
|
| High Drop Rate | > 10% entries dropped | Warning |
|
|
| No Log Activity | 0 entries/min for 5 min | Warning |
|
|
| Connection Limit | > 90% of max connections | Warning |
|
|
| Rate Limit High | > 20% requests blocked | Warning |
|
|
|
|
### Alert Script
|
|
|
|
Example monitoring script:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# monitor_alerts.sh
|
|
|
|
check_alert() {
|
|
local name=$1
|
|
local condition=$2
|
|
local message=$3
|
|
|
|
if eval "$condition"; then
|
|
echo "ALERT: $name - $message"
|
|
# Send to alerting system
|
|
# curl -X POST https://alerts.example.com/...
|
|
fi
|
|
}
|
|
|
|
while true; do
|
|
STATUS=$(curl -s http://localhost:8080/status)
|
|
|
|
if [ -z "$STATUS" ]; then
|
|
check_alert "SERVICE_DOWN" "true" "LogWisp not responding"
|
|
sleep 60
|
|
continue
|
|
fi
|
|
|
|
# Extract metrics
|
|
DROPPED=$(echo "$STATUS" | jq -r '.monitor.dropped_entries')
|
|
TOTAL=$(echo "$STATUS" | jq -r '.monitor.total_entries')
|
|
CLIENTS=$(echo "$STATUS" | jq -r '.server.active_clients')
|
|
|
|
# Check conditions
|
|
check_alert "HIGH_DROP_RATE" \
|
|
"[ $((DROPPED * 100 / TOTAL)) -gt 10 ]" \
|
|
"Drop rate above 10%"
|
|
|
|
check_alert "HIGH_CONNECTIONS" \
|
|
"[ $CLIENTS -gt 90 ]" \
|
|
"Near connection limit: $CLIENTS/100"
|
|
|
|
sleep 60
|
|
done
|
|
```
|
|
|
|
## Troubleshooting with Monitoring
|
|
|
|
### No Logs Appearing
|
|
|
|
Check monitor stats:
|
|
```bash
|
|
curl -s http://localhost:8080/status | jq '.monitor'
|
|
```
|
|
|
|
Look for:
|
|
- `active_watchers` = 0 (no files found)
|
|
- `total_entries` not increasing (files not updating)
|
|
|
|
### High CPU Usage
|
|
|
|
Enable debug logging:
|
|
```bash
|
|
logwisp --log-level debug --log-output stderr
|
|
```
|
|
|
|
Watch for:
|
|
- Frequent "checkFile" messages (reduce check_interval)
|
|
- Many filter operations (optimize patterns)
|
|
|
|
### Memory Growth
|
|
|
|
Monitor over time:
|
|
```bash
|
|
while true; do
|
|
ps aux | grep logwisp | grep -v grep
|
|
curl -s http://localhost:8080/status | jq '.server.active_clients'
|
|
sleep 10
|
|
done
|
|
```
|
|
|
|
### Connection Issues
|
|
|
|
Check connection stats:
|
|
```bash
|
|
# Current connections
|
|
curl -s http://localhost:8080/status | jq '.server'
|
|
|
|
# Rate limit stats
|
|
curl -s http://localhost:8080/status | jq '.features.rate_limit'
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Regular Monitoring**: Check status endpoints every 30-60 seconds
|
|
2. **Set Alerts**: Configure alerts for critical conditions
|
|
3. **Log Rotation**: Rotate LogWisp's own logs to prevent disk fill
|
|
4. **Baseline Metrics**: Establish normal ranges for your environment
|
|
5. **Capacity Planning**: Monitor trends for scaling decisions
|
|
6. **Test Monitoring**: Verify alerts work before issues occur
|
|
|
|
## See Also
|
|
|
|
- [Performance Tuning](performance.md) - Optimization guide
|
|
- [Troubleshooting](troubleshooting.md) - Common issues
|
|
- [Configuration Guide](configuration.md) - Monitoring configuration
|
|
- [Integration Examples](integrations.md) - Monitoring system integration |