Troubleshooting

Solutions for the most common issues encountered when running RosFit, from connection failures and error codes to performance optimization.

Connection issues

Robot not appearing in the dashboard

Symptoms: Device is registered but shows provisioning state and never transitions to online.

Checklist:

  1. Verify the bridge is running on the robot:
rosfit connect --log-level debug
  1. Check MQTT connectivity from the robot:
mosquitto_pub -h <broker_host> -p 1883 -u <device_id> -P <mqtt_token> -t "test" -m "hello"
  1. Verify the device_id in rosfit-bridge.yaml matches the registered device ID exactly.

  2. Check the broker logs for authentication failures:

docker compose logs emqx | grep "bot-01"
  1. Confirm network connectivity between the robot and the broker host. Firewalls must allow TCP on port 1883 (or 8883 for TLS).

ESP32 MQTT connection failing

Symptoms: ESP32 connects to Wi-Fi but fails to establish MQTT connection.

Common causes and fixes:

CauseFix
Wrong broker IP/portVerify with ping and telnet <host> 1883 from another device
Token expiredGenerate a new token: rosfit devices add --name ... or POST /auth/devices/token
TLS without CA certFlash the CA certificate to SPIFFS/LittleFS and load it in the MQTT client
Max packet size exceededReduce payload size or increase EMQX mqtt.max_packet_size
Client ID collisionEnsure each ESP32 uses a unique client ID (use device_id)

micro-ROS agent not discovering device

Symptoms: The micro-ROS agent is running but the ESP32 node does not appear in ros2 node list.

Fixes:

  1. Check serial connection:
ros2 run micro_ros_agent micro_ros_agent serial --dev /dev/ttyUSB0 -v6
  1. Verify baud rate matches between the ESP32 firmware and agent (default 115200).

  2. Check DDS domain ID — both agent and other ROS 2 nodes must use the same ROS_DOMAIN_ID.

  3. Reset the ESP32 after the agent starts — the handshake must occur during the agent's discovery window.

High latency on telemetry

Symptoms: Dashboard telemetry lags several seconds behind real-time.

Diagnosis:

# Check broker message backlog
docker compose exec emqx emqx_ctl metrics | grep messages

Fixes:

SymptomCauseFix
Consistent 2–5s delayQoS 2 overheadDowngrade to QoS 0 or 1 for telemetry
Increasing lag over timeConsumer can't keep upScale the message handler or reduce throttle_hz
Spiky latencyNetwork congestionCheck bandwidth; use throttle_hz to cap publish rate
High latency on one topicLarge message size (e.g. point clouds)Reduce resolution or increase throttle_hz interval

WebSocket connection dropped

Symptoms: Dashboard loses real-time updates and shows "Reconnecting..." banner.

Fixes:

  1. Check proxy configuration — nginx/Caddy must be configured for WebSocket upgrade:
location /ws {
    proxy_pass http://api:8000/ws;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 86400;
}
  1. Increase idle timeout — load balancers and proxies often close idle WebSocket connections after 60s. Set proxy_read_timeout to 86400 (24h) or implement ping/pong frames.

  2. Check API memory — a large number of simultaneous WebSocket connections can exhaust memory. Monitor with docker stats.

Common errors

MQTT_CONNECTION_REFUSED

Error: MQTT_CONNECTION_REFUSED (rc=5) or Connection refused: not authorized

Cause: Invalid credentials or the device is not registered.

Fix:

  1. Verify the MQTT token: rosfit devices info <device_id>
  2. Re-generate credentials: POST /auth/devices/token
  3. Check that the EMQX auth backend is running: docker compose logs emqx | grep auth

ROS_BRIDGE_TIMEOUT

Error: ROS_BRIDGE_TIMEOUT: No message received on /odom for 30s

Cause: The ROS 2 node publishing on the configured topic is not running, or the topic name is wrong.

Fix:

  1. Verify the topic exists: ros2 topic list | grep odom
  2. Check the message type matches: ros2 topic info /odom
  3. Verify QoS compatibility between the publisher and the bridge subscriber

DEVICE_SHADOW_CONFLICT

Error: DEVICE_SHADOW_CONFLICT: Version mismatch (expected 42, got 41)

Cause: Two writers tried to update the shadow simultaneously, creating a version conflict.

Fix: This is a transient error. The bridge automatically retries with the latest version. If it persists:

  1. Read the current shadow: GET /shadows/{device_id}
  2. Retry the update with the correct version number
  3. Reduce the number of concurrent shadow writers

CERTIFICATE_EXPIRED

Error: CERTIFICATE_EXPIRED: TLS handshake failed — certificate has expired

Cause: The device or broker TLS certificate has passed its expiry date.

Fix:

  1. Check certificate expiry: openssl x509 -in cert.crt -noout -dates
  2. Regenerate certificates: rosfit certs generate --cn <device_id> --days 730
  3. Provision the new certificate: rosfit certs provision <device_id> --deploy
  4. For the broker certificate, update and restart EMQX

RATE_LIMIT_EXCEEDED

Error: RATE_LIMIT_EXCEEDED: 429 Too Many Requests

Cause: The device or client is sending requests faster than the configured rate limit.

Fix:

  1. Add throttle_hz to high-frequency topics in the bridge config
  2. Increase the rate limit in API settings (default: 100 req/s per device, 1000 req/s per user)
  3. Use WebSocket for high-frequency data instead of REST polling

Performance tuning

MQTT keep-alive

The default keep-alive interval is 60 seconds. For battery-powered devices, increase this to reduce wake-ups. For latency-sensitive robots, decrease it for faster offline detection.

Use CaseKeep-AliveOffline Detection
Real-time robot15s~22s (1.5× keep-alive)
Default60s~90s
Battery-powered sensor300s~450s

Configure in the bridge:

connection:
  keep_alive_sec: 15

Telemetry rate optimization

Publishing telemetry too frequently wastes bandwidth and storage. Too infrequently and you miss events. Use throttle_hz to find the right balance:

TopicRecommended RateRationale
/odom5–10 HzSmooth position tracking
/scan1–5 HzLiDAR scans are large
/battery_state0.1–1 HzSlow-changing metric
/diagnostics0.1–0.5 HzPeriodic health check
/camera/compressed1–5 HzBandwidth-limited
/cmd_vel10–20 HzResponsive control

Database retention

TimescaleDB stores raw telemetry indefinitely by default. Configure a retention policy to automatically drop old data:

SELECT add_retention_policy('telemetry', INTERVAL '90 days');

For high-frequency data, use continuous aggregates to pre-compute rollups:

CREATE MATERIALIZED VIEW telemetry_hourly
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 hour', timestamp) AS bucket,
  device_id,
  avg((data->>'battery_percent')::float) AS avg_battery,
  min((data->>'battery_percent')::float) AS min_battery,
  max((data->>'battery_percent')::float) AS max_battery
FROM telemetry
GROUP BY bucket, device_id;

WebSocket connection pooling

The default API configuration allows 1000 concurrent WebSocket connections. For large dashboards with many simultaneous users, increase this in the API settings:

ROSFIT_WS_MAX_CONNECTIONS=5000
ROSFIT_WS_PING_INTERVAL=30
ROSFIT_WS_PING_TIMEOUT=10

QoS level tuning

QoSGuaranteeOverheadBest For
0At most onceLowestHigh-frequency telemetry (scan, camera)
1At least onceMediumImportant telemetry (battery, odom), commands
2Exactly onceHighestCritical commands (emergency stop, OTA)

Use QoS 0 for data you can afford to lose (sensor streams), QoS 1 for data that should arrive but can tolerate duplicates, and QoS 2 only for operations that must execute exactly once.