Wei, Mingjie

Wei, Mingjie

Ph.D. Candidate

Embodied Intelligence (EAI) Large Vision Language Model (LVLM) Reinforcement Learning (RL) Vision Language Action Model (VLA)
Ph.D. Candidate in Computer Science and Technology (supervisor: Prof. Zhang Wei-Nan)
Harbin Institute of Technology | 2023-Present (2023-2025 as Master)
B.Sc. in Software Engineering
Guangdong University of Foreign Studies (GDUFS) | 2018-2022

Wei Mingjie is currently a Ph.D. student jointly trained by the Harbin Institute of Technology (HIT) and the Zhongguancun Academy (BJZGCA), enrolled in the Fall 2025 cohort under the supervision of Prof. Yu Chao and Prof. Zhang Weinan. His primary research interests lie in Embodied Intelligence, Large Vision-Language Models, Reinforcement Learning, and Vision-Language-Action (VLA) models.

Starting in September 2025, he commenced his doctoral studies at BJZGCA. From Fall 2023 to Summer 2025, he led a collaborative team involving the SCIR Lab, State Key Laboratory of Robotics and Systems at HIT, and Shenzhen Leju Robot Co., developing an intelligent service robot for exhibition and hall environments — systems that are now deployed in several exhibition venues.

He previously completed a research internship at Li Auto and served as a Research Assistant at the Chinese University of Hong Kong, Shenzhen. Academically, he has authored two papers accepted to ACM MM 2025, along with a survey paper on Embodied Intelligence published in the robotics journal SmartBot. Recently, he has also joined the RLinf project — an open research effort by Infini Inc., BJZGCA, and Tsinghua University — contributing to the development of advanced reinforcement learning algorithms.

Key Technologies of Exhibition Hall Guide Robots based on Embodied Intelligence

2023.11-2025.06 | Team Leader, Main Contributor

We have developed a multi-agent framework, which consists of multiple agents, including a large model for user's intent recognition, a large model for navigation waypoint extraction, a large model for robotic action extraction, and a conversational agent enhanced by retrieval and historical dialogues. The framework aims to collaboratively process user instructions to enable intelligent robot interaction and task execution.

Completed
PRISM: A Benchmark for Unveiling Cross-modal Knowledge Inconsistency in Large Vision-Language Models

Mingjie Wei, Wei-Nan Zhang, Chen Zhang, Yifeng Ding, Donglin Di, Lei Ren, Wei Chen, Ting Liu

ACM Multimedia 2025, 2025

LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning

Shibo Sun, Xue Li, Donglin Di, Mingjie Wei, Lanshun Nie, Wei-Nan Zhang, Dechen Zhan, Yang Song, Lei Fan

ACM Multimedia 2025, 2025