Troubleshooting
Solutions for the most common issues encountered when running RosFit, from connection failures and error codes to performance optimization.
Connection issues
Robot not appearing in the dashboard
Symptoms: Device is registered but shows provisioning state and never transitions to online.
Checklist:
- Verify the bridge is running on the robot:
rosfit connect --log-level debug
- Check MQTT connectivity from the robot:
mosquitto_pub -h <broker_host> -p 1883 -u <device_id> -P <mqtt_token> -t "test" -m "hello"
-
Verify the
device_idinrosfit-bridge.yamlmatches the registered device ID exactly. -
Check the broker logs for authentication failures:
docker compose logs emqx | grep "bot-01"
- Confirm network connectivity between the robot and the broker host. Firewalls must allow TCP on port 1883 (or 8883 for TLS).
ESP32 MQTT connection failing
Symptoms: ESP32 connects to Wi-Fi but fails to establish MQTT connection.
Common causes and fixes:
| Cause | Fix |
|---|---|
| Wrong broker IP/port | Verify with ping and telnet <host> 1883 from another device |
| Token expired | Generate a new token: rosfit devices add --name ... or POST /auth/devices/token |
| TLS without CA cert | Flash the CA certificate to SPIFFS/LittleFS and load it in the MQTT client |
| Max packet size exceeded | Reduce payload size or increase EMQX mqtt.max_packet_size |
| Client ID collision | Ensure each ESP32 uses a unique client ID (use device_id) |
micro-ROS agent not discovering device
Symptoms: The micro-ROS agent is running but the ESP32 node does not appear in ros2 node list.
Fixes:
- Check serial connection:
ros2 run micro_ros_agent micro_ros_agent serial --dev /dev/ttyUSB0 -v6
-
Verify baud rate matches between the ESP32 firmware and agent (default 115200).
-
Check DDS domain ID — both agent and other ROS 2 nodes must use the same
ROS_DOMAIN_ID. -
Reset the ESP32 after the agent starts — the handshake must occur during the agent's discovery window.
High latency on telemetry
Symptoms: Dashboard telemetry lags several seconds behind real-time.
Diagnosis:
# Check broker message backlog
docker compose exec emqx emqx_ctl metrics | grep messages
Fixes:
| Symptom | Cause | Fix |
|---|---|---|
| Consistent 2–5s delay | QoS 2 overhead | Downgrade to QoS 0 or 1 for telemetry |
| Increasing lag over time | Consumer can't keep up | Scale the message handler or reduce throttle_hz |
| Spiky latency | Network congestion | Check bandwidth; use throttle_hz to cap publish rate |
| High latency on one topic | Large message size (e.g. point clouds) | Reduce resolution or increase throttle_hz interval |
WebSocket connection dropped
Symptoms: Dashboard loses real-time updates and shows "Reconnecting..." banner.
Fixes:
- Check proxy configuration — nginx/Caddy must be configured for WebSocket upgrade:
location /ws {
proxy_pass http://api:8000/ws;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 86400;
}
-
Increase idle timeout — load balancers and proxies often close idle WebSocket connections after 60s. Set
proxy_read_timeoutto 86400 (24h) or implement ping/pong frames. -
Check API memory — a large number of simultaneous WebSocket connections can exhaust memory. Monitor with
docker stats.
Common errors
MQTT_CONNECTION_REFUSED
Error: MQTT_CONNECTION_REFUSED (rc=5) or Connection refused: not authorized
Cause: Invalid credentials or the device is not registered.
Fix:
- Verify the MQTT token:
rosfit devices info <device_id> - Re-generate credentials:
POST /auth/devices/token - Check that the EMQX auth backend is running:
docker compose logs emqx | grep auth
ROS_BRIDGE_TIMEOUT
Error: ROS_BRIDGE_TIMEOUT: No message received on /odom for 30s
Cause: The ROS 2 node publishing on the configured topic is not running, or the topic name is wrong.
Fix:
- Verify the topic exists:
ros2 topic list | grep odom - Check the message type matches:
ros2 topic info /odom - Verify QoS compatibility between the publisher and the bridge subscriber
DEVICE_SHADOW_CONFLICT
Error: DEVICE_SHADOW_CONFLICT: Version mismatch (expected 42, got 41)
Cause: Two writers tried to update the shadow simultaneously, creating a version conflict.
Fix: This is a transient error. The bridge automatically retries with the latest version. If it persists:
- Read the current shadow:
GET /shadows/{device_id} - Retry the update with the correct version number
- Reduce the number of concurrent shadow writers
CERTIFICATE_EXPIRED
Error: CERTIFICATE_EXPIRED: TLS handshake failed — certificate has expired
Cause: The device or broker TLS certificate has passed its expiry date.
Fix:
- Check certificate expiry:
openssl x509 -in cert.crt -noout -dates - Regenerate certificates:
rosfit certs generate --cn <device_id> --days 730 - Provision the new certificate:
rosfit certs provision <device_id> --deploy - For the broker certificate, update and restart EMQX
RATE_LIMIT_EXCEEDED
Error: RATE_LIMIT_EXCEEDED: 429 Too Many Requests
Cause: The device or client is sending requests faster than the configured rate limit.
Fix:
- Add
throttle_hzto high-frequency topics in the bridge config - Increase the rate limit in API settings (default: 100 req/s per device, 1000 req/s per user)
- Use WebSocket for high-frequency data instead of REST polling
Performance tuning
MQTT keep-alive
The default keep-alive interval is 60 seconds. For battery-powered devices, increase this to reduce wake-ups. For latency-sensitive robots, decrease it for faster offline detection.
| Use Case | Keep-Alive | Offline Detection |
|---|---|---|
| Real-time robot | 15s | ~22s (1.5× keep-alive) |
| Default | 60s | ~90s |
| Battery-powered sensor | 300s | ~450s |
Configure in the bridge:
connection:
keep_alive_sec: 15
Telemetry rate optimization
Publishing telemetry too frequently wastes bandwidth and storage. Too infrequently and you miss events. Use throttle_hz to find the right balance:
| Topic | Recommended Rate | Rationale |
|---|---|---|
/odom | 5–10 Hz | Smooth position tracking |
/scan | 1–5 Hz | LiDAR scans are large |
/battery_state | 0.1–1 Hz | Slow-changing metric |
/diagnostics | 0.1–0.5 Hz | Periodic health check |
/camera/compressed | 1–5 Hz | Bandwidth-limited |
/cmd_vel | 10–20 Hz | Responsive control |
Database retention
TimescaleDB stores raw telemetry indefinitely by default. Configure a retention policy to automatically drop old data:
SELECT add_retention_policy('telemetry', INTERVAL '90 days');
For high-frequency data, use continuous aggregates to pre-compute rollups:
CREATE MATERIALIZED VIEW telemetry_hourly
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', timestamp) AS bucket,
device_id,
avg((data->>'battery_percent')::float) AS avg_battery,
min((data->>'battery_percent')::float) AS min_battery,
max((data->>'battery_percent')::float) AS max_battery
FROM telemetry
GROUP BY bucket, device_id;
WebSocket connection pooling
The default API configuration allows 1000 concurrent WebSocket connections. For large dashboards with many simultaneous users, increase this in the API settings:
ROSFIT_WS_MAX_CONNECTIONS=5000
ROSFIT_WS_PING_INTERVAL=30
ROSFIT_WS_PING_TIMEOUT=10
QoS level tuning
| QoS | Guarantee | Overhead | Best For |
|---|---|---|---|
| 0 | At most once | Lowest | High-frequency telemetry (scan, camera) |
| 1 | At least once | Medium | Important telemetry (battery, odom), commands |
| 2 | Exactly once | Highest | Critical commands (emergency stop, OTA) |
Use QoS 0 for data you can afford to lose (sensor streams), QoS 1 for data that should arrive but can tolerate duplicates, and QoS 2 only for operations that must execute exactly once.