Skip to content

serve-robotics/reward-function

Repository files navigation

calculate_reward.py — MCAP Reward Function Calculator

Reads a protobuf-encoded MCAP robotics log file and computes a per-timestep composite reward for sidewalk navigation:

r = r_centering + r_heading + r_speed + r_obstacle + r_jerk + r_acc + r_collision

Usage

# Activate the virtual environment
source .venv/bin/activate

# Basic run (prints summary to stdout)
python calculate_reward.py path/to/file.mcap

# Write per-timestep results to CSV
python calculate_reward.py path/to/file.mcap --csv rewards_output.csv

Dependencies

  • Python 3.8+
  • numpy — vector math (dot products, norms, array operations)
  • mcap-protobuf-support — reads and deserializes protobuf messages from .mcap files

Install into the virtual environment:

python -m venv .venv
source .venv/bin/activate
pip install numpy mcap mcap-protobuf-support

MCAP Topics Consumed

The script reads six topics from the MCAP file. All other topics are ignored.

Topic Protobuf Type Publish Rate Purpose
/cognition/sta_boundary pmx.STABoundary ~3 Hz Sidewalk centerline, left boundary, and right boundary as polylines (~12 points each, ~1.5 m spacing, ~16 m lookahead)
/odom pmx.Odometry ~20 Hz Robot pose (position + quaternion orientation) and twist (linear + angular velocity). Primary timeline — one reward is computed per odom message
/speed_limit pmx.msgs.speed_governor.SpeedLimit Sporadic Target maximum speed (max_speed field, in m/s)
/move_serve/planner_state pmx.msgs.move_serve.PlannerState ~10 Hz Obstacle distances via plan_metrics sub-message (distance to closest stationary and dynamic obstacles)
/move_serve/imu_jerk_filtered pmx.msgs.std_msgs.FloatValueStamped ~10 Hz Filtered jerk scalar (rate of change of acceleration, from IMU)
/move_serve/proximity/state pmx.msgs.proximity.State ~10 Hz Collision state via in_collision boolean field

Time Synchronization

Topics publish at different rates. The script uses /odom as the primary clock and iterates over every odom message. For each odom timestamp, it performs a binary search on each of the other four topic lists to find the message closest in time. This is handled by the find_closest() function, which runs in O(log n) per lookup.

All messages are stored as TimedValue dataclass instances that pair a float timestamp (seconds since Unix epoch) with the decoded protobuf message. Lists are sorted by timestamp after loading, which is a prerequisite for the binary search.


Protobuf Timestamp Conversion

Protobuf timestamps have two integer fields: seconds (Unix epoch) and nanos (0–999999999). The proto_ts() helper combines them into a single float for arithmetic:

float_time = seconds + nanos * 1e-9

Robot State Extraction (from /odom)

At each timestep, three quantities are extracted from the odom message:

  • Position (robot_xy) — 2D ground-plane position from pose.translation.x and pose.translation.y (metres, global frame). The z-component is ignored since all reward geometry is 2D.
  • Yaw (robot_yaw) — heading angle in radians (−π to +π) extracted from the pose quaternion (pose.rotation) using the standard quaternion-to-Euler conversion via quat_to_yaw(). The formula isolates yaw from the quaternion's (x, y, z, w) components using atan2(2(wz + xy), 1 - 2(y² + z²)).
  • Speed — ground-plane speed magnitude from sqrt(twist.linear.x² + twist.linear.y²) (m/s).

Acceleration Calculation

There is no direct acceleration topic in the MCAP file. Instead, acceleration is derived from odom speed using finite differences:

acceleration = (speed_current - speed_previous) / (t_current - t_previous)

The first timestep defaults to 0.0 acceleration since there is no previous sample. A guard ensures the time delta is non-zero (> 1e-6 s) to avoid division by zero.


Reward Components

r_centering — Sidewalk Centering Reward

Range: [0.0, 1.0] Sources: /cognition/sta_boundary + /odom

Rewards the robot for staying near the center of the sidewalk. The calculation:

  1. Project the robot's position onto the centerline, left boundary, and right boundary polylines using closest_point_on_polyline() to get dist_to_center, dist_to_left, and dist_to_right.
  2. Estimate the sidewalk half-width as (dist_to_left + dist_to_right) / 2.
  3. Compute ratio = dist_to_center / half_width, clamped to [0, 1].
  4. Return 1.0 - ratio.

The result is 1.0 when perfectly centered and falls linearly to 0.0 at the boundary edge.

r_heading — Heading Alignment Reward

Range: [−1.0, +1.0] Sources: /cognition/sta_boundary + /odom

Rewards the robot for heading in the direction of intended travel. The intended direction is defined as the vector from the robot's position to a point 3 metres ahead along the centerline polyline (obtained from polyline_direction_at()). This 3m lookahead smooths out sharp segment-to-segment direction changes and better represents where the robot should be aiming. The robot's heading is converted from yaw to a 2D unit vector [cos(yaw), sin(yaw)].

The reward is the dot product of the two unit vectors, which equals cos(angle_between_them):

  • +1.0 — perfectly aligned with the intended direction
  • 0.0 — perpendicular to the intended direction
  • −1.0 — facing directly backwards

r_speed — Speed Matching Reward

Range: [0.0, 1.0] Sources: /odom + /speed_limit

Rewards the robot for travelling at the target max speed. The formula:

r_speed = max(0.0, 1.0 - |current_speed - max_speed| / max_speed)

This gives 1.0 when speed equals max_speed, and falls off linearly in both directions (too slow or too fast). The reward reaches 0.0 when speed is 0 or 2× max_speed, and is clamped so it never goes negative.

If max_speed is ~0, the reward is 1.0 only if the robot is also stopped.

r_obstacle — Obstacle Proximity Penalty

Range: [−5.0, 0.0] Sources: /move_serve/planner_state

A large negative reward for being too close to obstacles. Uses the planner's pre-computed distances to the nearest stationary obstacle (walls, poles, curbs) and dynamic obstacle (pedestrians, cars, bikes). The minimum of the two is used.

Three zones with tunable thresholds:

Zone Distance Penalty
Safe ≥ 1.0 m 0.0 (no penalty)
Transition 0.2 m – 1.0 m Linear interpolation from −5.0 to 0.0
Critical ≤ 0.2 m −5.0 (full penalty)

If the planner reports no obstacles (distance fields are 0 or absent), the penalty is 0.0.

r_jerk — Jerk Penalty

Range: (−∞, 0.0] Sources: /move_serve/imu_jerk_filtered

Penalises change in acceleration (jerk) to encourage smooth motion. The IMU provides a pre-filtered jerk scalar value.

r_jerk = -0.5 * |jerk_value|

The scaling coefficient (0.5) controls penalty strength. The reward is always ≤ 0.

r_acc — Acceleration Penalty

Range: (−∞, 0.0] Sources: Derived from /odom speed via finite differences

Penalises large accelerations to discourage harsh speed changes.

r_acc = -0.3 * |acceleration|

The scaling coefficient (0.3) controls penalty strength. The reward is always ≤ 0.

r_collision — Collision Penalty

Range: [−10.0, 0.0] Sources: /move_serve/proximity/state

A severe negative reward for being in collision. Uses the in_collision boolean field from the proximity state message. This penalty is particularly important for non-holonomic robots where collisions can cause wheel slip or require complex recovery maneuvers.

r_collision = -10.0 if in_collision else 0.0

The penalty is designed to strongly dominate the reward signal during collision events, ensuring the robot learns to avoid collisions at all costs.


Geometry: Polyline Operations

The STA boundary data stores sidewalk edges as polylines — ordered lists of 2D points connected by line segments. Sidewalk edges curve, bend at corners, and have irregular widths, so a single line segment cannot represent them.

closest_point_on_polyline

Finds the closest point on a polyline to a given query point. For each segment in the polyline:

  1. Compute the segment direction vector ab = b - a.
  2. Project the query point onto the infinite line through a and b using the formula t = dot(pt - a, ab) / dot(ab, ab).
  3. Clamp t to [0, 1] so the projection stays on the finite segment (not the infinite extension).
  4. Compute the projected point a + t * ab and its distance to the query point.
  5. Track the minimum distance across all segments.

Returns the closest point, the distance, and the index of the closest segment.

polyline_direction_at

Returns a unit direction vector representing the intended heading (used for the heading reward). The direction is computed as follows:

  1. Project the robot's position onto the polyline to find the closest point (proj_pt).
  2. Walk forward along the polyline from proj_pt for a configurable lookahead distance (default 3 m), stepping through segments and consuming distance until the target point is reached. If the polyline ends before the full distance, the last point is used.
  3. Compute the direction from the robot's actual position to the 3m-ahead target point, and normalize to a unit vector.

This approach is more stable than using a single segment's direction because it looks ~2 segments ahead (~1.5 m each), smoothing out kinks at segment boundaries. It also naturally accounts for upcoming curves and nudges the robot back toward center if it has drifted off the centerline.


Output

Console Summary

After processing all odom timesteps, the script prints:

  • Message counts per topic (to verify data loaded correctly)
  • Reward summary table with mean, standard deviation, min, and max for each component and the total reward
  • Average speed (m/s) and average |acceleration| (m/s²)

CSV Output (optional)

When --csv is provided, writes one row per timestep with columns:

Column Description
time Timestamp in seconds (Unix epoch float)
r_centering Centering reward [0, 1]
r_heading Heading alignment reward [−1, 1]
r_speed Speed matching reward [0, 1]
r_obstacle Obstacle penalty [−5, 0]
r_jerk Jerk penalty (−∞, 0]
r_acc Acceleration penalty (−∞, 0]
r_collision Collision penalty [−10, 0]
r_total Sum of all seven components
speed Raw speed in m/s
acceleration Raw acceleration in m/s²

Tunable Constants

All reward component constants are defined as global variables at the top of calculate_reward.py for easy modification:

Constant Default Purpose
SAFE_DIST 1.0 m No obstacle penalty above this distance
CRITICAL_DIST 0.2 m Full obstacle penalty at or below this distance
MAX_PENALTY −5.0 Maximum obstacle penalty value
COLLISION_PENALTY −10.0 Severe penalty for collision events
JERK_SCALE 0.5 Jerk penalty scaling coefficient
ACC_SCALE 0.3 Acceleration penalty scaling coefficient
LOOKAHEAD_DIST 3.0 m How far ahead along the centerline to look for heading direction

Input Validation

The script validates that the MCAP file contains all required topics before processing:

  • Required Topics: All six topics listed in the table above must be present
  • Error Handling: If any required topics are missing, the script will:
    • Display the missing topics
    • List all available topics found in the file
    • Exit with error code 1
  • File Validation: Also checks that the MCAP file can be read and is not corrupted

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages