Suhwan Choi

I'm an undergraduate student majoring in Physics and Computer Science at Seoul National University. I'm currently a Principal Researcher at Maum.ai, where I lead the autonomous robotics research division.

My main research interests are in approximating and imitating human behavior and intelligence in multimodal modalities, utilizing end-to-end architectures and scalable training suites. I focus on embodied AI, robotic navigation, vision-language models, and multimodal learning.

Email / CV / LinkedIn / Github / Blog

Research & Publications

I work on embodied AI, robotic navigation, and multimodal learning. My research focuses on scaling vision-action pretraining, commonsense-aware navigation systems, and vision-language model improvements. Some papers are highlighted.

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Suhwan Choi*, Jaeyoon Jung*, Haebin Seong*, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu†, Yunsung Lee†
Under Review
project page

Scaling vision-action pretraining on desktop data enables effective transfer to embodied AI tasks.

Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks
Giyeong Oh, Woohyun Cho, Siyeol Kim, Suhwan Choi, Youngjae Yu†
NeurIPS 2025
arXiv

Revisiting residual connections with orthogonal updates for more stable and efficient deep networks.

CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction
Suhwan Choi*, Yongjun Cho*, Minchan Kim*, Jaeyoon Jung*, Myunchul Joe, Yubeen Park, Minseo Kim, Sungwoong Kim, Sungjae Lee, Hwiseong Park, Jiwan Chung, Youngjae Yu†
ICRA 2025 (Outstanding Paper Award at NeurIPS 2024 Workshop, 3%)
project page

A commonsense-aware navigation system that enables intuitive human-robot interaction through natural language understanding.

ESREAL: Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models
Minchan Kim*, Minyeong Kim*, Junik Bae*, Suhwan Choi, Sungkyung Kim, Buru Chang†
ECCV 2024
arXiv

Exploiting semantic reconstruction to mitigate hallucinations in vision-language models.

Experience

Principal Researcher at Maum.ai (Feb 2024 – Present)

Founded autonomous robotics research division as the first researcher, leading strategic decisions and team expansion to 10 researchers.
Contributed as first author to majority of research projects in robotic navigation and embodied AI.
Led CORE: Slurm-based DGX Cluster construction project (96 H100 GPUs, 12 nodes). [Blog]
Implemented company-wide Notion workspace enhancing productivity and streamlining workflows. [Template]

Machine Learning Engineer Intern at Hyperconnect (July 2023 – Jan 2024)

Worked on diffusion-based personalized profile image generation for real-world applications.

Awards & Honors

QHack Coding Challenge (2023 and 2024)

Ranked 4th/793 teams in 2023, Ranked 3rd/618 teams in 2024.
Contest implementing quantum algorithms, quantum machine learning, quantum chemistry, and brain-teasing puzzles.

2023 Quantum Hackathon (2023)

1st place, Minister of Science and ICT Award
Topic: Utilizing symmetry to solve variational quantum algorithm (quantum machine learning) efficiently.

NAVER CLOVA AI RUSH 2022 (July – Sept 2022)

3rd place on Landmark Detection (3,000,000 KRW)
2nd place on Shopping User Embedding Extraction, Classification (7,000,000 KRW)

Google Codejam 2022 (2022)

Round 3, 546th (awarded T-Shirt).

Open Source Projects

Open World Agents

Core contributor and maintainer with 180+ merged PRs. Built comprehensive multimodal desktop agent framework including optimized data collection tool (ocap), standardized efficient data format (OWAMcap), dataset visualizer, multimedia data management/processing pipelines, agent training, Python packaging, and CI/CD infrastructure.

open-world-agents/MediaRef

Pydantic media reference for images and video frames (with timestamp support) from data URIs, HTTP URLs, file URIs, and local paths. Features lazy loading and optimized batch video decoding.

MilkClouds/vla0-trl

Unofficial reimplementation of VLA-0 using TRL's SFTTrainer. While common VLA codebases are over 10,000 lines, vla0-trl contains only ~1,200 lines total. Gets ~92% on LIBERO by just fine-tuning Qwen2.5-VL to predict actions as text, without any custom architecture.

MilkClouds/smon

Real-time Slurm cluster monitoring tool with interactive TUI with Textual. Visualizes GPU/CPU/memory allocation across nodes with job-level drill-down.

MilkClouds/lazyregistry

A lightweight Python library for lazy-loading registries with namespace support and type safety.

Website source code available on GitHub.