Learning from human videos is a promising direction for addressing data scarcity in robot learning, but existing methods rely on human-robot alignment or intermediate representations (e.g., trajectories), limiting scalability. How can we leverage large-scale video datasetsโwhether from robots or humansโwithout relying on any labels or data collection constraints? We propose UniSkill, a framework that learns embodiment-agnostic skill representations from large-scale, unlabeled, cross-embodiment video data. These representations enable robot policies trained only on robot data to imitate skills from human video prompts and support flexible sub-goal generation, regardless of the demonstration's embodiment.
UniSkill is a scalable framework designed for learning cross-embodiment skill representations from large-scale video datasets. It leverages an intuitive image-editing pipeline built upon two core components:
UniSkill enables imitation of demonstration videos across diverse embodiments. We evaluate UniSkill's embodiment-agnostic property on both the tabletop and kitchen benchmarks. In the tabletop benchmark, UniSkill successfully imitates tasks even when prompted by human demonstration videos. In the kitchen benchmark, it goes further by successfully imitating prompts from Anubis—a robot with a different embodiment that was entirely unseen during training. These results demonstrate that UniSkill's cross-embodiment skill representation effectively captures transferable skills across diverse embodiments.
UniSkill's skill representations generalize to unseen tasks. We evaluate UniSkill's skill representation through compositional tasks. Despite being trained only on individual tasks, UniSkill successfully performs task combinations (i.e., unseen tasks) using both robot and human prompts. This highlights the compositionality of its skill representation.
UniSkill demonstrates scene generalization in unseen environments. To validate robustness to unseen environments, we modify the background and objects in the prompt videos. UniSkill continues to succeed under these changes, demonstrating scene generalization. This generalization also holds in simulation experiments, where human prompts are naturally collected in scenes differ from the simulation environment. |
![]() |
UniSkill can generate future frames using embodiment-agnostic skill representations. UniSkill's ISD and FSD can be used independently. ISD extracts a skill representation from any prompt, and this representation can then be used to condition FSD, regardless of the source. As a result, sub-goal images can be generated from skill representations derived from different prompts—even those with different embodiments or from different scenes. Given the same current observation, FSD can produce distinct sub-goal images depending on the skills, enabling flexible and generalizable sub-goal generation.
![]() |
![]() |
Improving GCBC with UniSkill. In simulation environments with human prompts, GCBC performs poorly due to large visual discrepancies between the prompt and current observation. To address this, we introduce GCBC-U, which conditions the policy on sub-goal images generated by FSD. These generated sub-goals help bridge the visual gap and improve performance.
UniSkill's well-structured skill representations enable precise future frame predictions. Similar to UniSkill, LAPA also leverages large-scale video datasets to extract cross-embodiment latent actions. To compare with their cross-embodiment skill (or action) representations, we evaluate the quality of predicted future frames from each method. On both seen and unseen demonstrations, UniSkill produces sharp and accurate predictions, while LAPA generates noticeably blurrier images.
Task A-B
⬇
Task A-B-C
⬇
Task A-B-C-D
⬇
Task A-B
⬇
@article{kim2025uniskillimitatinghumanvideos,
title={UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations},
author={Hanjung Kim and Jaehyun Kang and Hyolim Kang and Meedeum Cho and Seon Joo Kim and Youngwoon Lee},
journal = {arXiv preprint arXiv:2505.08787},
year={2025},
}