Wei, Mingjie

Wei, Mingjie

Ph.D. Candidate

Embodied Intelligence (EAI) Large Vision Language Model (LVLM) Reinforcement Learning (RL) Vision Language Action Model (VLA)
Ph.D. Candidate in Computer Science and Technology (supervisor: Prof. Zhang Wei-Nan)
Harbin Institute of Technology | 2023-Present (2023-2025 as Master)
B.Sc. in Software Engineering
Guangdong University of Foreign Studies (GDUFS) | 2018-2022

Wei Mingjie is currently a Ph.D. student jointly trained by Harbin Institute of Technology (HIT) and Zhongguancun Academy (bjzgca), enrolled in the Fall of 2025. I’m supervised by Prof. Yu Chao and Prof. Zhang Weinan. My primary research interests include Embodied Intelligence, Large Vision Large Models, Reinforcement Learning, and Vision-Language-Action model.

I will start my studies at bjzgca in September this year. From the Fall of 2023 to the Summer of 2025, I have served as the team leader (collaborating with SCIR Lab at HIT, State Key Laboratory of Robotics and Systems at HIT, and Shenzhen Leju Robot) to develop an Intelligent Service Robot for Exhibition/Hall scenario, which are currently operational in several exhibition halls.

I have previously completed a research internship at Li Auto and served as a research assistant (RA) at the Chinese University of Hong Kong, Shenzhen. Regarding academic publications, I have authored two papers published in CCF A-level conferences, with an additional survey paper on Embodied Intelligence currently under journal review.

Key Technologies of Exhibition Hall Guide Robots based on Embodied Intelligence

2023.11-2025.06 | Team Leader, Main Contributor

We have developed a multi-agent framework, which consists of multiple agents, including a large model for user's intent recognition, a large model for navigation waypoint extraction, a large model for robotic action extraction, and a conversational agent enhanced by retrieval and historical dialogues. The framework aims to collaboratively process user instructions to enable intelligent robot interaction and task execution.

Completed
PRISM: A Benchmark for Unveiling Cross-modal Knowledge Inconsistency in Large Vision-Language Models

Mingjie Wei, Wei-Nan Zhang, Chen Zhang, Yifeng Ding, Donglin Di, Lei Ren, Wei Chen, Ting Liu

ACM Multimedia 2025, 2025

LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning

Shibo Sun, Xue Li, Donglin Di, Mingjie Wei, Lanshun Nie, Wei-Nan Zhang, Dechen Zhan, Yang Song, Lei Fan

ACM Multimedia 2025, 2025