Zhe Chen

Zhe Chen (陈喆)

Phd Candidate at Nanjing University

chenzhe98@smail.nju.edu.cn

[CV]

About Me

My name is Zhe Chen (陈喆), now I am a fourth-year PhD candidate in the School of Computer Science at Nanjing University (NJU), supervised by Prof. Tong Lu.

I started my studies in 2020 through a combined Master's and PhD program, which includes 2 years for the master's degree and 4 years for the PhD, and I expect to graduate in 2026.

My research interests span LLM agents, multimodal large language models (MLLMs), vision foundation models (VFMs), and visual perception. Here is a brief overview of my research journey:

2021-2023: I focused on vision foundation models and visual perception tasks such as object detection and semantic segmentation. My representative works during this period include ViT-Adapter (ICLR 2023 Spotlight) and InternImage (CVPR 2023 Highlight, Most Influential CVPR 2023 Papers Rank 10).
2023-2025: I shifted my focus to multimodal large language models (MLLMs). I was responsible for the from-scratch pre-training of the visual backbone (InternViT-6B) and the post-training of MLLMs. I led the development of InternVL 1.0 (CVPR 2024 Oral, Youth Outstanding Paper Award at WAIC 2025) to InternVL 2.5, and contributed to InternVL 3.0-3.5.
2025-2026: My primary focus has been on LLM agents. I have been deeply involved in the development of MiroThinker v0.1-v1.5, a state-of-the-art open-source search agent model.

News

[2026-02] 🎉🎉 2 papers are accepted by CVPR 2026.
[2026-01] 🎉 3 papers are accepted by ICLR 2026.
[2026-01] MiroThinker 1.5 is released, a SOTA open-source search agent model.
[2025-08] Our team released InternVL 3.5.
[2025-07] InternVL received the Youth Outstanding Paper Award at the 2025 World Artificial Intelligence Conference (2025世界人工智能大会青年优秀论文奖).
[2025-04] Our team released InternVL 3.0.
[2025-01] Vision-RWKV and OmniCorpus are accepted as ICLR 2025 spotlight papers.
[2024-12] InternVL 1.5 is accepted by Science China Information Sciences.
[2024-12] Our team released InternVL 2.5.
[2024-07] Our team released InternVL 2.0.
[2024-04] Our team released InternVL 1.5.
[2024-02] InternVL (oral) is accepted by CVPR 2024.
[2024-01] GeoDiffusion, All-Seeing, and BoS (spotlight) are accepted by ICLR 2024.
[2023-10] AVSegFormer is accepted by AAAI 2024.
[2023-10] InternImage is selected as one of CVPR 2023 Top-10 Influential Papers.
[2023-09] VisionLLM is accepted by NeurIPS 2023.
[2023-07] DDP is accepted by ICCV 2023.
[2023-02] InternImage (highlight) is accepted by CVPR 2023.
[2023-01] ViT-Adapter (spotlight) is accepted by ICLR 2023.

More News

[2023-05] We release InternGPT, which allows you to interact with ChatGPT by clicking, dragging and drawing using a pointing device.
[2023-04] GPTrans is accepted by IJCAI 2023.
[2023-01] Our team wins the champion of WSDM Cup 2023 Toloka VQA Challenge.
[2022-11] Our InternImage-H created new record of 65.4 box AP on COCO test-dev!
[2022-09] Our team wins the champions in 7 tracks of Ego4D ECCV2022 Challenge.
[2021-12] URST is accepted by AAAI 2022.
[2020-12] Our team wins the champion of NAIC 2020 Remote Sensing Semantic Segmentation Task (1,000,000 RMB bonus).
[2020-05] SiameseCCR is accepted by IET Image Processing.

Education & Experiences

Nanjing University, Nanjing, China
Sept 2020 - Present
Zhejiang University of Science and Technology, Hangzhou, China
Sept 2016 - June 2020

Selected Publications

* refers to the co-first authors. The full paper list can be found on Google Scholar.

LLM Agent

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, et al. (alphabetical order)

Technical Report, 2025

Introduction: An open-source search agent model that achieves SOTA performance through interactive scaling.

[Paper] [BibTex] [Code ] [Model] [Demo]

Multimodal Large Language Model

InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen*, Weiyun Wang*, Yue Cao*, Yangzhou Liu*, Zhangwei Gao*, Erfei Cui*, Jinguo Zhu*, Shenglong Ye*, Hao Tian*, Zhaoyang Liu*, Lixin Gu, Xuehui Wang, Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Jiaye Ge, Kai Chen, Kaipeng Zhang, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang

Technical Report, 2024

[Paper] [BibTex] [Code ] [Model]

InternVL 1.5: How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen*, Weiyun Wang*, Hao Tian*, Shenglong Ye*, Zhangwei Gao, Erfei Cui, ..., Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang#

Science China Information Sciences (CCF-A), 2024

Introduction: A Pioneering Open-Source Alternative to GPT-4V.

[Paper] [BibTex] [Code ] [Model] [中文解读]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai#

CVPR Oral, 2024 | Most Influential CVPR 2024 Papers (Rank 12)

Youth Outstanding Paper Award at the 2025 World Artificial Intelligence Conference

Introduction: InternVL scales up the ViT to 6B parameters and aligns it with LLM.

[Paper] [BibTex] [Code ] [Model] [Poster] [中文解读]

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Wenhai Wang*, Zhe Chen*, Xiaokang Chen*, Jiannan Wu*, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai#

NeurIPS, 2023

Introduction: We present an LLM-based framework for vision-centric tasks, termed VisionLLM.

[Paper] [BibTex] [Code ] [Poster]

Vision Foundation Model

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Wenhai Wang*, Jifeng Dai*, Zhe Chen*, Zhenhang Huang*, Zhiqi Li*, Xizhou Zhu*, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao#

CVPR Highlight, 2023 | Most Influential CVPR 2023 Papers (Rank 10)

Introduction: This work presents a new large-scale CNN-based foundation model, termed InternImage.

[Paper] [BibTex] [Code ] [Poster] [中文解读]

Vision Transformer Adapter for Dense Predictions

Zhe Chen*, Yuchen Duan*, Wenhai Wang#, Junjun He, Tong Lu#, Jifeng Dai, Yu Qiao

ICLR Spotlight, 2023

Introduction: This work present a simple yet powerful adapter for pure ViT, which can remedy the defects of ViT and achieve comparable performance to vision-specific models in dense prediction tasks.

[Paper] [BibTex] [Code ] [Poster] [Slides] [中文解读]

Visual Perception

Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments

Yang Yang, Wenhai Wang, Zhe Chen, Jifeng Dai, Liang Zheng#

ICLR Spotlight, 2024

Introduction: A brand-new data-centric problem of estimating the detector performance in an unlabeled test domain.

[Paper] [BibTex] [Code ] [Poster]

GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation

Kai Chen*, Enze Xie*, Zhe Chen, Lanqing Hong#, Zhenguo Li, Dit-Yan Yeung

ICLR, 2024

Introduction: GeoDiffusion translates geometric conditions into text prompts, enhancing T2I models for generating detection data, and improves object detector performance.

[Paper] [BibTex] [Code ] [Poster]

AVSegFormer: Audio-Visual Segmentation with Transformer

Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu#

AAAI, 2024

Introduction: This work presents a new framework for AVS tasks that leverages the transformer architecture.

[Paper] [BibTex] [Code ] [Poster]

DDP: Diffusion Model for Dense Visual Prediction

Yuanfeng Ji*, Zhe Chen*, Enze Xie#, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, Ping Luo

ICCV, 2023

Introduction: We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline.

[Paper] [BibTex] [Code ] [Poster]

Other Papers

Graph Propagation Transformer for Graph Representation Learning

Zhe Chen, Hao Tan, Tao Wang, Tianrun Shen, Tong Lu#, Qiuying Peng, Cheng Cheng, Yue Qi

IJCAI, 2023

Introduction: This work presents a novel transformer architecture for graph representation learning.

[Paper] [BibTex] [Code ]

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization

Zhe Chen, Wenhai Wang#, Enze Xie, Tong Lu#, Ping Luo

AAAI, 2022

Introduction: URST is a versatile framework for ultra-high resolution style transfer under limited GPU memory resources.

[Paper] [BibTex] [Code ] [Poster] [中文解读]

Awards & Honors

Album

WAIC 2025 · Youth Outstanding Paper Award

WSDM Cup 2023 · Toloka VQA · 1st Place

Ego4D 2022 · State Change Object Detection · 1st Place

NAIC 2020 · Remote Sensing Segmentation · 1st

Contests

Toloka Visual Question Answering Challenge, WSDM Cup 2023, 2023, 1st Place.
The 2nd Ego4D Challenge, ECCV Workshop, 2022, 7 Top-1 Rankings.
The 2nd National Artificial Intelligence Challenge (NAIC), Remote Sensing Semantic Segmentation Track, 2020, 1st Place (1,000,000 RMB Bonus).
The 2nd China Gaofen Cup Beautiful Countryside Competition, Remote Sensing Crop Classification Track, 2019, 3rd Prize (5,000 RMB Bonus).
The 9th National Undergraduate E-commerce "Innovation, Creativity and Entrepreneurship" Challenge, Zhejiang Division, 2019, 1rd Prize.

Honors

Youth Outstanding Paper Award at the 2025 World Artificial Intelligence Conference
Youth PhD Student Research Project under the National Natural Science Foundation
Nanjing University Egret Scholarship
Outstanding Graduate of Zhejiang Province
Zhejiang Provincial Government Scholarship

Some of My Friends

Guo Chen, Zhiqi Li, Chunjiang Ge, Yuanfeng Ji, Yang Yang, Zhanhao Liang