LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations

Lin, Yutang; Cui, Jieming; Li, Yixuan; Jia, Baoxiong; Zhu, Yixin; Huang, Siyuan

LessMimic Long-Horizon Humanoid Interaction with
Unified Distance Field Representations

Yutang Lin^{*, 1, 2, 3, 5, 6, 7} Jieming Cui^{*, 1, 2, 3, 5, 6, 7} Yixuan Li^{4, 2, 5} Baoxiong Jia^{†, 2, 5} Yixin Zhu^{†, 3, 1, 5, 6, 7} Siyuan Huang^{†, 2, 5}

^*Equal contribution ^†Corresponding author

¹Institute for AI, Peking University

²Beijing Institute for General Artificial Intelligence (BIGAI)

³School of Psychological and Cognitive Sciences, Peking University

⁴School of Computer Science and Technology, Beijing Institute of Technology

⁵State Key Lab of General AI

⁶Beijing Key Laboratory of Behavior and Mental Health, Peking University

⁷Embodied Intelligence Lab, PKU-Wuhan Institute for Artificial Intelligence

Paper Code arXiv Abstract Live Demo

Key Contributions #

Generalize across objects

The same policy lifts and manipulates objects with diverse geometry (e.g., 23 cm box to 60 cm cylinder) by operating on a distance-field representation.

Multiple skills in one policy

A single policy composes pick up, push, sit/stand, and carry over long horizons without task-specific modules.

No hand-crafted reward

Training uses adversarial interaction priors in the geometric domain instead of motion-tracking or hand-designed shaping rewards.

Root command & action label only

At inference, control is specified solely by target root trajectory and action labels; no motion exemplars or retargeting.

How We Do It

We use a distance-field-centric policy representation. The top row visualizes DF-driven interaction for three different skills (pick, push, sit), and the second row shows scale generalization.

Method Overview

Unified distance-field representation and policy learning pipeline.

DF Visualization: Pick

DF visualization for the pick skill

DF Visualization: Push

DF visualization for the push skill

DF Visualization: Sit

DF visualization for the sit skill

Scale Generalization in Simulation

Pick objects at different scales with distance-field visualization

Live MuJoCo Session

Use this embedded session to inspect online rollout behavior directly in-browser. Add your own objects to the session and interact with them.
Feel free to test how purely data-driven interaction policy complete tasks (or fail).

The live simulator is paused by default to avoid heavy startup on page load.

Hosted locally for reproducible demos and GitHub Pages compatibility. Open it in a new tab if you need more space: Humanoid Policy Viewer.
Note: The objects' DF is discretized and cached, so the policy's performance is slightly degraded. To report issues, please contact yutang.lin@stu.pku.edu.cn.
Please use Google Chrome for best performance.

1. Interact with Various Objects #

One policy generalizes across object shapes and sizes—box, foam cylinder, and ball.
*The foam cylinder and soccer ball are not in training data.

Pick up: box

2. Autonomously Recover from Disturbance/Failure #

After external disturbance or failed pickup attempts, the policy re-initiates interaction and recovers autonomously using continuous geometric feedback.

Recovery: box after slipperiness

3. Execute Consequent Long-Horizon Tasks #

A single policy composes multiple skills in sequence over extended horizons, showing smooth transitions between interaction modes.

Combination: sit → push (2x speed)

4. Mocap and Depth Input as Perception #

Our method runs with either motion-capture object state or egocentric depth only—enabling deployment with or without external tracking.

Mocap

Egocentric depth

Same task: left = mocap object state, right = depth-only policy.

Additional Vision-Only Demos

Push chest + chair (vision 2x speed)

Push chair (vision)

Our method can push objects with different geometries and scales. Thanks Yixuan for his dedication 😁.

Push + pick (vision 2x speed)

Push + pick (ego depth 2x speed)

Long-horizon completion: push + pick (vision). Right: egocentric depth perception.

Angle 1

Angle 2

Angle 3

Foam pickup from three different angles.

Quantitative Results at a Glance #

Visual summaries: simulation scale generalization, long-horizon completion, and real-world deployment.

Single-Task Scale Generalization (Ours)

One merged figure for PickUp success across object scales (from TABLE II).

Ours (Mocap) Ours (Vision) HDMI ResMimic PhysHSI VisualMimic

Y-axis is PickUp success (%) from 0 to 100. Ours variants remain strong at large scales while reference-based methods degrade sharply off reference scale.

Long-Horizon: Ours (Mocap) Only

Completion rate of our main method as the number of sequential tasks increases.

Task count grows from 5 to 40 without resets, showing graceful performance decay under much longer horizons.

Abstract #

Humanoid robots that autonomously interact with physical environments over extended horizons represent a central goal of embodied intelligence. Existing approaches rely on reference motions or task-specific rewards, tightly coupling policies to particular object geometries and precluding multi-skill generalization within a single framework. A unified interaction representation enabling reference-free inference, geometric generalization, and long-horizon skill composition within one policy remains an open challenge. Here we show that Distance Field (DF) provides such a representation: LessMimic conditions a single whole-body policy on DF-derived geometric cues—surface distances, gradients, and velocity decompositions—removing the need for motion references, with interaction latents encoded via a Variational Autoencoder (VAE) and post-trained using Adversarial Interaction Prior (AIP)-derived Reinforcement Learning (RL). Through DAgger-style distillation that aligns DF latents with egocentric depth features, LessMimic further transfers seamlessly to vision-only deployment without motion capture infrastructure. A single LessMimic policy achieves 80–100% success across object scales from 0.4× to 1.6× on PickUp and SitStand where baselines degrade sharply, attains 62.1% success on 5-task trajectories, and remains viable up to 40 sequentially composed tasks. By grounding interaction in local geometry rather than demonstrations, LessMimic offers a scalable path toward humanoid robots that generalize, compose skills, and recover from failures in unstructured environments.

Paper (PDF)

If the PDF does not display, download it directly.

BibTeX

@article{lin2026lessmimic,
  title={LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations},
  author={Yutang Lin and Jieming Cui and Yixuan Li and Baoxiong Jia and Yixin Zhu and Siyuan Huang},
  journal={arXiv preprint arXiv:2602.21723},
  year={2026}
}

LessMimic Long-Horizon Humanoid Interaction with
Unified Distance Field Representations