UniSkill: Imitating Human Videos via
Cross-Embodiment Skill Representations

preprint

Yonsei University
*Equal contribution

TL;DR: Learning from Large-Scale Video Datasets for Cross-Embodiment Skill Representations.

Abstract

Teaser

Learning from human videos is a promising direction for addressing data scarcity in robot learning, but existing methods rely on human-robot alignment or intermediate representations (e.g., trajectories), limiting scalability. How can we leverage large-scale video datasetsโ€”whether from robots or humansโ€”without relying on any labels or data collection constraints? We propose UniSkill, a framework that learns embodiment-agnostic skill representations from large-scale, unlabeled, cross-embodiment video data. These representations enable robot policies trained only on robot data to imitate skills from human video prompts and support flexible sub-goal generation, regardless of the demonstration's embodiment.

Overall Framework of UniSkill

Main figure

UniSkill is a scalable framework designed for learning cross-embodiment skill representations from large-scale video datasets. It leverages an intuitive image-editing pipeline built upon two core components:

  • Inverse Skill Dynamics (ISD): Captures and extracts meaningful motion patterns between frames, representing the core skills demonstrated within video sequences.
  • Forward Skill Dynamics (FSD): Utilizes the skill representation provided by ISD to predict future frames, effectively forecasting how skills unfold over time.
By training a skill-conditioned policy on robotic data using these learned skill representations, UniSkill achieves embodiment-agnostic skill learning, thus facilitating policy transfer across different robotic and human embodiments. During inference, UniSkill supports robust imitation learning from human demonstration videos. It automatically extracts generalized skills, enabling robots to imitate complex tasks demonstrated by humans without embodiment-specific retraining.

Experiments

Imitating Videos via Cross-Embodiment Skill Representation

Imitation Image

UniSkill enables imitation of demonstration videos across diverse embodiments. We evaluate UniSkill's embodiment-agnostic property on both the tabletop and kitchen benchmarks. In the tabletop benchmark, UniSkill successfully imitates tasks even when prompted by human demonstration videos. In the kitchen benchmark, it goes further by successfully imitating prompts from Anubis—a robot with a different embodiment that was entirely unseen during training. These results demonstrate that UniSkill's cross-embodiment skill representation effectively captures transferable skills across diverse embodiments.

Composition Image

UniSkill's skill representations generalize to unseen tasks. We evaluate UniSkill's skill representation through compositional tasks. Despite being trained only on individual tasks, UniSkill successfully performs task combinations (i.e., unseen tasks) using both robot and human prompts. This highlights the compositionality of its skill representation.

UniSkill demonstrates scene generalization in unseen environments. To validate robustness to unseen environments, we modify the background and objects in the prompt videos. UniSkill continues to succeed under these changes, demonstrating scene generalization. This generalization also holds in simulation experiments, where human prompts are naturally collected in scenes differ from the simulation environment.

Skill-Based Future Frame Prediction

UniSkill can generate future frames using embodiment-agnostic skill representations. UniSkill's ISD and FSD can be used independently. ISD extracts a skill representation from any prompt, and this representation can then be used to condition FSD, regardless of the source. As a result, sub-goal images can be generated from skill representations derived from different prompts—even those with different embodiments or from different scenes. Given the same current observation, FSD can produce distinct sub-goal images depending on the skills, enabling flexible and generalizable sub-goal generation.

Improving GCBC with UniSkill. In simulation environments with human prompts, GCBC performs poorly due to large visual discrepancies between the prompt and current observation. To address this, we introduce GCBC-U, which conditions the policy on sub-goal images generated by FSD. These generated sub-goals help bridge the visual gap and improve performance.

UniSkill's well-structured skill representations enable precise future frame predictions. Similar to UniSkill, LAPA also leverages large-scale video datasets to extract cross-embodiment latent actions. To compare with their cross-embodiment skill (or action) representations, we evaluate the quality of predicted future frames from each method. On both seen and unseen demonstrations, UniSkill produces sharp and accurate predictions, while LAPA generates noticeably blurrier images.

Rollout Videos

Cross-Embodiment Imitation

Robot Prompt
Pull out the tissue
GCBC
Success
XSkill
Fail (Stuck)
UniSkill
Success
Human Prompt
Pull out the tissue
GCBC
Fail (Different Task)
XSkill
Fail (Stuck)
UniSkill
Success

Skill Composition

Robot-to-Robot

Task A-B

Task A-B-C

Task A-B-C-D

Human-to-Robot

Task A-B

BibTeX

@article{kim2025uniskillimitatinghumanvideos,
    title={UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations},
    author={Hanjung Kim and Jaehyun Kang and Hyolim Kang and Meedeum Cho and Seon Joo Kim and Youngwoon Lee},
    journal = {arXiv preprint arXiv:2505.08787},
    year={2025},
}