[Paper Notes] Morphologically Symmetric Reinforcement Learning for Ambidextrous Bimanual Manipulation
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
SYMDEX asks a clean question: if a bimanual robot is physically symmetric, why should one arm have to relearn what the other arm already discovered? The paper turns bilateral morphology into an RL prior for ambidextrous manipulation. It decomposes a complex bimanual task into per-hand subtasks, trains each subtask policy with a symmetry-equivariant actor and symmetry-invariant critic, then distills those specialist policies into a global ambidextrous policy.
The key design is not merely data augmentation. SYMDEX encodes the robot’s reflection group directly into the policy class:
\[ g \triangleright_A \pi(o) = \pi(g \triangleright_O o) \]
In practice, this means a reflected scene should produce a correspondingly reflected action. On six Isaac Lab bimanual dexterous tasks, SYMDEX reaches more than 80% success and outperforms PPO baselines. The authors also report zero-shot real-world transfer on box-lift and table-clean, with curriculum learning playing a major role.
Paper Info
- Title: Morphologically Symmetric Reinforcement Learning for Ambidextrous Bimanual Manipulation
- Authors: Zechu Li, Yufeng Jin, Daniel Ordonez Apraez, Claudio Semini, Puze Liu, Georgia Chalvatzaki
- Venue: CoRL 2025
- arXiv: 2505.05287
- Project page: supersglzc.github.io/projects/symdex
- Codebase read here: PyTorch / Isaac Lab implementation of SYMDEX with ESCNN and MorphoSymm
1. Motivation
Humans can mirror many gross manipulation skills between left and right hands, but fine dexterity often develops a dominant side. Robots do not need this kind of handedness. If the hardware is bilaterally symmetric, a robot should be able to choose whichever arm is better placed for the current scene.
The problem is that bimanual RL is hard in exactly the places where this symmetry would help:
- the observation and action spaces are high-dimensional,
- both arms contribute to one task-level outcome,
- reward shaping becomes messy when one arm succeeds and the other fails,
- ambidexterity turns a fixed two-arm controller into a task-assignment problem.
SYMDEX addresses this by making the learning problem smaller and more structured: learn one policy per subtask, enforce morphology-aware equivariance, then distill the behavior into a single deployable policy.
2. Method
The paper formulates bimanual manipulation as a multi-task, multi-agent POMDP. Each robot arm is an agent, and each manipulation role is a subtask. In the running example, one arm holds a bowl while the other operates an egg beater. Under a left-right reflection, the roles should swap.
For a symmetry group \(G\), the paper assumes the POMDP dynamics, rewards, and initial-state distribution are invariant under group actions. This gives the usual policy equivariance and value invariance conditions:
\[ g \triangleright_A \pi^(\sigma(s)) = \pi^(\sigma(g \triangleright_S s)) \]
\[ V^(\sigma(s)) = V^(\sigma(g \triangleright_S s)) \]
SYMDEX uses this structure in three steps.
Subtask Decomposition
Instead of training one monolithic 44-DoF bimanual policy, SYMDEX trains 22-DoF single-arm policies. Each policy sees the assigned arm state and task-specific object state. This reduces action dimensionality and gives each policy a cleaner reward signal.
Equivariant PPO
Each subtask actor is a \(G\)-equivariant neural network, while each critic is \(G\)-invariant. The actor should transform actions consistently when observations are reflected; the critic should assign the same value to symmetric states.
The intuition is simple: a left hand reaching in a mirrored workspace should behave like a transformed version of the right hand reaching in the original workspace.
Global Policy Distillation
After the subtask policies are trained, they generate a dataset of state-action pairs. A student policy is then trained to imitate the combined behavior. This student is also equivariant, but unlike the subtask policies, it observes the global non-privileged state and learns the task-arm assignment implicitly.
This is the deployable policy: one global ambidextrous controller, trained from specialist teachers.
3. Curriculum for Sim-to-Real
The paper uses a curriculum with two practical pieces:
- Randomization curriculum: begin with scene-level symmetry randomization, then gradually introduce object pose and physical-parameter variation.
- Safety curriculum: introduce collision and energy penalties later, after the policy has learned useful task behavior.
This matters. In real-world evaluation, an equivariant Gaussian policy without curriculum drops sharply, while the curriculum-trained version transfers much better.
4. Experiments
SYMDEX is evaluated on six simulated Isaac Lab tasks:
| Task | Main challenge |
|---|---|
| Box-lift | coordinated symmetric lifting |
| Table-clean | two-arm sweeping / object handling |
| Drawer-insert | asymmetric object and drawer roles |
| Threading | coordinated precise insertion |
| Bowl-stir | one arm stabilizes, the other manipulates |
| Handover | role-specific grasp and transfer |
The paper compares against five PPO-style baselines: monolithic equivariant PPO, independent PPO, equivariant independent PPO, a centralized-critic variant, and a symmetry-augmentation variant.
The headline simulation result is that SYMDEX learns all six tasks and exceeds 80% success, while the baselines fail especially when the two arms must perform different roles. This supports two claims at once: task decomposition helps credit assignment, and architectural equivariance is stronger than only augmenting data.
For distillation, the paper compares:
| Student | Box | Table | Drawer | Threading | Bowl | Handover |
|---|---|---|---|---|---|---|
| Gaussian policy | 0.83 | 0.74 | 0.69 | 0.62 | 0.75 | 0.54 |
| Equivariant Gaussian policy | 0.89 | 0.83 | 0.87 | 0.63 | 0.87 | 0.86 |
| Equivariant Diffusion policy | 0.91 | 0.84 | 0.87 | 0.60 | 0.88 | 0.68 |
Both equivariant students improve over the vanilla Gaussian student. Interestingly, the Gaussian equivariant student is more robust than the diffusion variant in the real world, which the authors attribute to the homogeneous teacher-generated dataset and imperfect state estimation at deployment time.
5. Codebase Reading
The repository is a compact Isaac Lab project. The public entry points are straightforward:
train.py # train SYMDEX with Hydra configs and W&B logging
visualize.py # load saved actors and execute policies in simulation
random_actions.py
symdex/cfg/
symdex/env/tasks/
symdex/algo/
symdex/utils/
The README exposes six tasks:
insertDrawer, boxLift, pickObject, stirBowl, threading, handover
The default training command is:
python train.py task=insertDrawer save_model=True
The most important implementation pieces are:
Symmetry Configuration
symdex/cfg/task/base.yaml defines the reflection group:
group_label: C2
symmetric_envs: True
permutation_Q_js: ...
reflection_Q_js: ...
permutation_student_Q_js: ...
reflection_student_Q_js: ...
For the single-arm policy, the joint representation keeps the 22-DoF order and applies joint-specific sign flips. For the student/global policy, permutation_student_Q_js swaps the two 22-DoF halves, while reflection_student_Q_js applies the corresponding signs.
Equivariant Networks
symdex/utils/symmetry.py builds the ESCNN group and registers representations for joint space, tangent joint space, Euclidean vectors, pseudo-vectors, and flattened rotations. symdex/algo/network/emlp.py then uses those field types to build equivariant MLPs.
The implementation is particularly nice because it handles both actor and critic cases:
- if the output representation is non-trivial, the EMLP is equivariant;
- if the output is trivial, the EMLP pools invariant features and behaves as an invariant function.
That matches the paper’s actor/critic split almost directly.
PPO Agent
symdex/algo/eqs.py defines AgentSYMDEX. It creates two actor-critic pairs:
actor, critic
actor_left, critic_left
When same_policy is enabled, these can share parameters. Otherwise they are optimized separately, which mirrors the paper’s dedicated subtask-policy setup.
During rollout, the agent:
- reads the environment’s
symmetry_tracker, - slices each subtask observation through
SymmetryManager.get_multi_agent_obs, - samples actions from the two actors,
- combines or swaps actions through
get_execute_action, - splits detailed rewards back into subtask rewards through
get_multi_agent_rew, - runs PPO updates for the right and left buffers.
This is the code-level version of the paper’s MTMA-POMDP decomposition.
Symmetric Environments
Task YAML files such as insertDrawer.yaml, stirBowl.yaml, threading.yaml, and handover.yaml define both original and _symmetry reward terms, plus single_agent_obs_idx_symmetry and single_agent_rew_symmetry. The environment can therefore train on original and reflected configurations while giving each subtask policy the right observation and reward slice.
6. Strengths
The best part of SYMDEX is that it treats symmetry as a control prior, not as a dataset trick. The robot’s morphology constrains the policy class, so symmetric configurations are tied together by construction. This is exactly the kind of inductive bias that can make RL less wasteful.
The task decomposition is also practical. A single global policy must solve exploration, credit assignment, role specialization, and symmetry at the same time. SYMDEX separates those concerns: train specialists first, then distill.
Finally, the four-arm extension is conceptually important. The symmetry group changes from bilateral reflection \(C_2\) to a rotational group \(C_4\), but the learning recipe remains the same. That suggests the framework is more general than a hand-coded left-right swap.
7. Limitations
The method depends on real symmetry. If the hardware, sensors, task roles, or object affordances are not actually symmetric, the inductive bias can become a constraint in the wrong direction.
The paper also works mostly with state-based policies. In real-world failures, perception is a major bottleneck because the controller depends on accurate multi-object pose tracking. The authors mention RGB-D and point-cloud equivariant models as future directions, and that feels like the right next step.
There is also a pipeline cost: subtask decomposition, reward design, symmetry-field configuration, and distillation are all extra engineering. SYMDEX pays that cost to make difficult bimanual RL tractable, but it is not a plug-and-play method for arbitrary manipulation tasks.
8. Takeaways
SYMDEX is a strong example of morphology-aware learning: instead of asking a network to rediscover left-right structure from rollouts, encode that structure in the policy and value function.
For practice, I would reach for this recipe when:
- the robot has clear morphological symmetry,
- the task can be decomposed into meaningful arm-level subtasks,
- mirrored initial states should imply mirrored optimal actions,
- exploration and reward assignment are the main bottlenecks,
- sim-to-real robustness matters enough to justify a curriculum.
I would be more cautious when the task has hidden asymmetry in tooling, object affordances, perception, or safety constraints. In those cases, symmetry may still help, but it probably needs selective application rather than a blanket architectural prior.
