Troubleshooting

Solutions for the most common issues encountered when running RosFit, from connection failures and error codes to performance optimization.

Connection issues

Robot not appearing in the dashboard

Symptoms: Device is registered but shows provisioning state and never transitions to online.

Checklist:

Verify the bridge is running on the robot:

rosfit connect --log-level debug

Check MQTT connectivity from the robot:

mosquitto_pub -h <broker_host> -p 1883 -u <device_id> -P <mqtt_token> -t "test" -m "hello"

Verify the device_id in rosfit-bridge.yaml matches the registered device ID exactly.
Check the broker logs for authentication failures:

docker compose logs emqx | grep "bot-01"

Confirm network connectivity between the robot and the broker host. Firewalls must allow TCP on port 1883 (or 8883 for TLS).

ESP32 MQTT connection failing

Symptoms: ESP32 connects to Wi-Fi but fails to establish MQTT connection.

Common causes and fixes:

Cause	Fix
Wrong broker IP/port	Verify with `ping` and `telnet <host> 1883` from another device
Token expired	Generate a new token: `rosfit devices add --name ...` or `POST /auth/devices/token`
TLS without CA cert	Flash the CA certificate to SPIFFS/LittleFS and load it in the MQTT client
Max packet size exceeded	Reduce payload size or increase EMQX `mqtt.max_packet_size`
Client ID collision	Ensure each ESP32 uses a unique client ID (use `device_id`)

micro-ROS agent not discovering device

Symptoms: The micro-ROS agent is running but the ESP32 node does not appear in ros2 node list.

Fixes:

Check serial connection:

ros2 run micro_ros_agent micro_ros_agent serial --dev /dev/ttyUSB0 -v6

Verify baud rate matches between the ESP32 firmware and agent (default 115200).
Check DDS domain ID — both agent and other ROS 2 nodes must use the same ROS_DOMAIN_ID.
Reset the ESP32 after the agent starts — the handshake must occur during the agent's discovery window.

High latency on telemetry

Symptoms: Dashboard telemetry lags several seconds behind real-time.

Diagnosis:

# Check broker message backlog
docker compose exec emqx emqx_ctl metrics | grep messages

Fixes:

Symptom	Cause	Fix
Consistent 2–5s delay	QoS 2 overhead	Downgrade to QoS 0 or 1 for telemetry
Increasing lag over time	Consumer can't keep up	Scale the message handler or reduce `throttle_hz`
Spiky latency	Network congestion	Check bandwidth; use `throttle_hz` to cap publish rate
High latency on one topic	Large message size (e.g. point clouds)	Reduce resolution or increase `throttle_hz` interval

WebSocket connection dropped

Symptoms: Dashboard loses real-time updates and shows "Reconnecting..." banner.

Fixes:

Check proxy configuration — nginx/Caddy must be configured for WebSocket upgrade:

location /ws {
    proxy_pass http://api:8000/ws;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 86400;
}

Increase idle timeout — load balancers and proxies often close idle WebSocket connections after 60s. Set proxy_read_timeout to 86400 (24h) or implement ping/pong frames.
Check API memory — a large number of simultaneous WebSocket connections can exhaust memory. Monitor with docker stats.

Common errors

MQTT_CONNECTION_REFUSED

Error: MQTT_CONNECTION_REFUSED (rc=5) or Connection refused: not authorized

Cause: Invalid credentials or the device is not registered.

Fix:

Verify the MQTT token: rosfit devices info <device_id>
Re-generate credentials: POST /auth/devices/token
Check that the EMQX auth backend is running: docker compose logs emqx | grep auth

ROS_BRIDGE_TIMEOUT

Error: ROS_BRIDGE_TIMEOUT: No message received on /odom for 30s

Cause: The ROS 2 node publishing on the configured topic is not running, or the topic name is wrong.

Fix:

Verify the topic exists: ros2 topic list | grep odom
Check the message type matches: ros2 topic info /odom
Verify QoS compatibility between the publisher and the bridge subscriber

DEVICE_SHADOW_CONFLICT

Error: DEVICE_SHADOW_CONFLICT: Version mismatch (expected 42, got 41)

Cause: Two writers tried to update the shadow simultaneously, creating a version conflict.

Fix: This is a transient error. The bridge automatically retries with the latest version. If it persists:

Read the current shadow: GET /shadows/{device_id}
Retry the update with the correct version number
Reduce the number of concurrent shadow writers

CERTIFICATE_EXPIRED

Error: CERTIFICATE_EXPIRED: TLS handshake failed — certificate has expired

Cause: The device or broker TLS certificate has passed its expiry date.

Fix:

Check certificate expiry: openssl x509 -in cert.crt -noout -dates
Regenerate certificates: rosfit certs generate --cn <device_id> --days 730
Provision the new certificate: rosfit certs provision <device_id> --deploy
For the broker certificate, update and restart EMQX

RATE_LIMIT_EXCEEDED

Error: RATE_LIMIT_EXCEEDED: 429 Too Many Requests

Cause: The device or client is sending requests faster than the configured rate limit.

Fix:

Add throttle_hz to high-frequency topics in the bridge config
Increase the rate limit in API settings (default: 100 req/s per device, 1000 req/s per user)
Use WebSocket for high-frequency data instead of REST polling

Performance tuning

MQTT keep-alive

The default keep-alive interval is 60 seconds. For battery-powered devices, increase this to reduce wake-ups. For latency-sensitive robots, decrease it for faster offline detection.

Use Case	Keep-Alive	Offline Detection
Real-time robot	15s	~22s (1.5× keep-alive)
Default	60s	~90s
Battery-powered sensor	300s	~450s

Configure in the bridge:

connection:
  keep_alive_sec: 15

Telemetry rate optimization

Publishing telemetry too frequently wastes bandwidth and storage. Too infrequently and you miss events. Use throttle_hz to find the right balance:

Topic	Recommended Rate	Rationale
`/odom`	5–10 Hz	Smooth position tracking
`/scan`	1–5 Hz	LiDAR scans are large
`/battery_state`	0.1–1 Hz	Slow-changing metric
`/diagnostics`	0.1–0.5 Hz	Periodic health check
`/camera/compressed`	1–5 Hz	Bandwidth-limited
`/cmd_vel`	10–20 Hz	Responsive control

Database retention

TimescaleDB stores raw telemetry indefinitely by default. Configure a retention policy to automatically drop old data:

SELECT add_retention_policy('telemetry', INTERVAL '90 days');

For high-frequency data, use continuous aggregates to pre-compute rollups:

CREATE MATERIALIZED VIEW telemetry_hourly
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 hour', timestamp) AS bucket,
  device_id,
  avg((data->>'battery_percent')::float) AS avg_battery,
  min((data->>'battery_percent')::float) AS min_battery,
  max((data->>'battery_percent')::float) AS max_battery
FROM telemetry
GROUP BY bucket, device_id;

WebSocket connection pooling

The default API configuration allows 1000 concurrent WebSocket connections. For large dashboards with many simultaneous users, increase this in the API settings:

ROSFIT_WS_MAX_CONNECTIONS=5000
ROSFIT_WS_PING_INTERVAL=30
ROSFIT_WS_PING_TIMEOUT=10

QoS level tuning

QoS	Guarantee	Overhead	Best For
0	At most once	Lowest	High-frequency telemetry (scan, camera)
1	At least once	Medium	Important telemetry (battery, odom), commands
2	Exactly once	Highest	Critical commands (emergency stop, OTA)

Use QoS 0 for data you can afford to lose (sensor streams), QoS 1 for data that should arrive but can tolerate duplicates, and QoS 2 only for operations that must execute exactly once.

Frequently asked questions