[Paper Notes] EquiBim: Learning Symmetry-Equivariant Policy for Bimanual Manipulation
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
EquiBim is a simple but useful idea for bimanual imitation learning: if a task is left-right symmetric, then the policy should behave consistently when the observation is mirrored and the two arms are swapped. Instead of building a special equivariant neural architecture, the paper adds a prediction-level regularization term:
\[ L_{sym} = |\pi(S(O)) - S(\pi(O))|_2^2 \]
This makes EquiBim model-agnostic. It can be attached to image policies, point-cloud policies, joint-space actions, or end-effector actions, as long as the symmetry transform \(S\) is defined for both observations and actions. In simulation on RoboTwin and real-world experiments on a dual LeRobot SO101 setup, the method improves average success and robustness under mirrored or shifted object distributions.
From the codebase side, the repository implements the idea inside LeRobot’s ACT policy: mirror the observation, run the same policy again, mirror the original prediction, and penalize the mismatch.
Paper Info
- Title: EquiBim: Learning Symmetry-Equivariant Policy for Bimanual Manipulation
- Authors: Zhiyuan Zhang, Aditya Mohan, Seungho Han, Wan Shou, Dongyi Wang, Yu She
- Affiliations: Purdue University, University of Arkansas
- arXiv: 2603.08541
- Project page: zhangzhiyuanzhang.github.io/equibim-website
- Codebase: a LeRobot-based implementation for bimanual SO101 manipulation and ACT symmetry loss
1. Motivation
Bimanual robots have a structural prior that many learned policies only use implicitly: the left and right arms are often physically symmetric, and many tasks remain equivalent if we mirror the workspace and exchange the two arms.
Standard behavior cloning does not enforce this. If the dataset has more clean demonstrations on one side than the other, or if the test object appears in a mirrored pose, a policy can produce inconsistent left/right behavior even when the mirrored strategy should be valid. This is especially visible in dual-arm manipulation because coordination errors are not just visual mistakes; they become timing, grasping, and role-assignment failures.
EquiBim’s claim is that this symmetry should be an explicit training signal. The nice part is that the signal lives at the policy-output level, so it does not require redesigning the backbone.
2. Method
The policy is trained by behavior cloning. Given an observation history \(O\), the policy predicts a future action sequence:
\[ \pi: O \rightarrow A \]
EquiBim defines a symmetry transformation \(S\) over both observation and action spaces. For a bimanual setup, \(S\) corresponds to a left-right reflection plus an exchange of the two arms.
The desired equivariance is:
\[ \pi(S(O)) \approx S(\pi(O)) \]
So the symmetry loss is:
\[ L_{sym} = |\pi(S(O)) - S(\pi(O))|_2^2 \]
The full training objective keeps the ordinary imitation loss and adds this consistency term. Importantly, both branches use the same policy parameters. The model is not asked to learn a new task; it is asked to be self-consistent under a physically meaningful transformation.
3. Symmetry Across Modalities
The paper handles several observation/action combinations:
| Component | Symmetry transform |
|---|---|
| RGB image | horizontal flip |
| Point cloud | transform to image/camera-aligned frame, reflect along lateral axis, transform back |
| End-effector pose | reflect position and orientation consistently in the control frame |
| Joint action | swap left/right arms and apply joint-specific sign flips from the robot kinematics |
This is the main reason the method is practical. The regularizer is the same, while the implementation of \(S\) changes with the sensor and action representation.
4. Results
RoboTwin Simulation
The paper evaluates eight symmetric bimanual tasks in RoboTwin: Beat Block Hammer, Click Alarmclock, Handover Block, Move Can Pot, Pick Dual Bottles, Place Empty Cup, Stamp Seal, and Press Stapler.
Average success rates improve across all tested observation/action settings:
| Backbone / Setting | Baseline | + EquiBim | Gain |
|---|---|---|---|
| DP, Image + Joint | 34.1 | 43.6 | +9.5 |
| DP, Image + EE | 37.3 | 40.0 | +2.7 |
| DP3, Point Cloud + Joint | 73.5 | 77.9 | +4.4 |
| DP3, Point Cloud + EE | 74.5 | 77.8 | +3.3 |
The biggest gain is in the weakest geometric setting, image observations plus joint-space actions. That makes intuitive sense: images do not explicitly encode 3D structure, and joint actions are less directly spatial than end-effector poses. The symmetry loss supplies a missing structural prior.
The paper also reports that some tasks can drop under EquiBim, especially when the optimal strategy contains useful role asymmetry. Handover Block and Pick Dual Bottles are examples where timing, contact, or grasp ordering can make the mirrored policy less universally correct. This is an important caveat: symmetry regularization helps when task-level symmetry dominates, but it can fight the data when the task only looks symmetric geometrically.
Real-World SO101 Experiments
The real-world setup uses two LeRobot SO101 arms with a centered Logitech C920x camera. The top-down camera arrangement makes horizontal image flips line up naturally with the workspace’s left-right direction.
The paper evaluates:
- Banana Handover
- Drumstick Hook Hanging
- Toy Chicken Hook Hanging
With 50 demonstrations per task and 10 evaluation trials per object, ACT + EquiBim improves robustness, especially under distribution shifts:
| Task / Distribution | ACT | ACT + EquiBim |
|---|---|---|
| Banana, training distribution | 3/10 | 6/10 |
| Banana, shifted distribution | 0/10 | 5/10 |
| Drumstick, shifted distribution | 1/10 | 4/10 |
| Toy Chicken, shifted distribution | 4/10 | 6/10 |
The banana result is the clearest: when the object orientation and side placement are mirrored relative to training, vanilla ACT fails completely, while the equivariant version keeps half the trials successful.
5. Codebase Reading
The repository is built on LeRobot and adds a practical bimanual SO101 workflow:
bimanual_teleop.py
bimanual_teleop_camera.py
bimanual_data_collection.py
train_bimanual.sh
bimanual_capture_home_pose.py
bimanual_inference.py
The symmetry switch is exposed in ACT config:
use_sym_loss: bool = False
eq_loss_weight: float = 0.1
The main implementation lives in src/lerobot/policies/act/modeling_act.py.
The code defines a 6-dimensional per-arm sign vector:
JOINT_SIGN = torch.tensor([-1, +1, +1, +1, -1, +1])
Then mirror_state and mirror_action split the 12-dimensional bimanual vector into left and right halves, apply the sign transform, and concatenate the swapped result:
[left_arm, right_arm] -> [signed_right_arm, signed_left_arm]
For images, the implementation simply flips along the width dimension:
torch.flip(img, dims=[-1])
During training, ACT first computes the ordinary L1 action loss. If use_sym_loss is enabled, it builds a mirrored batch, predicts actions on that mirrored input, and adds:
eq_loss = mse(mirror_action(actions_hat.detach()), actions_hat_sym)
loss = l1_loss + eq_loss_weight * eq_loss
The detach() is a small but meaningful implementation choice: the original prediction acts like the target for the mirrored branch, so the symmetry term regularizes the mirrored pass without letting both sides chase each other in the same backward path.
6. Strengths and Limitations
Strengths. EquiBim is easy to add to existing imitation learning systems, and the inductive bias is physically meaningful. The method also gives a concrete recipe for using stronger demonstrations on one side to regularize weaker demonstrations on the other side. For low-cost bimanual platforms where data is limited, this is a very practical advantage.
Limitations. The method depends on the symmetry transform being correct. If the camera is not centered, the robot mounting is not symmetric, the object interaction is role-specific, or the task has real left/right asymmetry, the loss can penalize useful behavior. The paper is honest about this through the per-task drops in simulation. Also, the real-world evaluation is promising but still small: 10 trials per object and three task families.
7. Takeaways
EquiBim is a good reminder that not every robotics improvement needs a larger model. Sometimes the right move is to encode a physical prior that the robot already gives us for free.
For bimanual learning, bilateral symmetry is one of those priors. The paper’s contribution is to turn it into a model-agnostic consistency loss that works across modalities and action spaces. The codebase makes the same point in a very direct way: mirror the batch, predict again, mirror the original prediction, and penalize disagreement.
For practice, I would treat EquiBim as a strong default when:
- the hardware is physically symmetric,
- the camera frame is aligned with the left-right workspace axis,
- the task allows arm-role exchange,
- the dataset is small or imbalanced across sides.
I would be more cautious when the task has hidden asymmetry in timing, force, grasp order, object affordance, or demonstration convention. In those cases, symmetry is still useful, but probably needs a schedule, a lower weight, or task-aware gating.
