Homepage: https://jarvisustc.github.io/
I am a fourth-year Ph.D. student in the joint program between the University of Science and Technology of China (USTC) and Microsoft Research Asia (MSRA), co-supervised by Prof. Qiang Huo at MSRA and Prof. Jun Du at USTC. My Ph.D. research focuses on Document Intelligence (including OCR, document layout analysis, and document understanding) and Large Language Models (including MLLM, Agent, and RAG). Prior to this, I received my B.S. degree from the School of the Gifted Young (a.k.a. ε°εΉ΄η) at the University of Science and Technology of China in 2021, majoring in Computer Science.
During my Ph.D. studies, I gained valuable industry experience through internships at MSRA, DeepSeek, and ByteDance. At DeepSeek, I contributed to the development of DeepSeek VL2, DeepSeek V3, and DeepSeek R1. My internship at MSRA involved working on the Microsoft OneOCR project and the Microsoft Document Intelligence project under the guidance of Researcher Qiang Huo and Lei Sun. Most recently, I began an internship with the ByteDance Seed team, where I am working on LLM/MLLM Agent projects. I have published over 10 papers at top-tier international AI journals and conferences.
I am currently seeking full-time job opportunities. If you are interested in my resume, please feel free to email me at [email protected]. I am currently based in Beijing, China. If you would like to have a coffee chat, please feel free to reach out! βπβ¨
- 2025.04: π Thrilled to kick off my new internship with the ByteDance Seed Team!
- 2025.04: π₯ Our latest work on VLM robustness, "Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks," has been released on Arxiv. We've open-sourced the Robust-VLGuard dataset and DiffPure-VLM defense.
- 2025.03: π Our paper "UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis" has been accepted by Pattern Recognition Journal!
- 2024.12: π» We've launched a new GitHub project: Awesome-Multimodal-RAG! Check out the latest in multimodal RAG and contribute!
- 2024.12: π€ We're excited to have contributed to DeepSeek-VL2, an advanced Vision-Language Model with strong performance and fewer parameters.
- 2024.08-09: π£οΈ Presented DLAFormer and DRFormer at ICDAR in Athens! Photos can be found here. A memorable experience meeting colleagues and exploring the city.
- 2024.08: βοΈ The complete version of DLAFormer, titled "UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis", has been submitted to Pattern Recognition Journal.
- 2024.07: π Our Detect-Order-Construct have been accepted by Pattern Recognition!
- 2024.06: π£οΈ Our DLAFormer, UniVIE, and DRFormer selected for oral presentation at ICDAR 2024!
π₯ More News
- 2024.03: π Azure AI Document Intelligence now supports Hierarchical Document Structure Analysis (HDSA), based on our "Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis" paper. Details on arXiv and the official announcement.
- 2024.03: π» Source code released for our Language-Enhanced Image New Category Discovery solution from the CVPR 2023 HIT Workshop.
- 2024.02: βοΈ Our new work on Document Layout Analysis, DLAFormer: A End-to-End Transformer for Document Layout Analysis, submitted to ICDAR 2024.
- 2024.01: π‘ Introduced UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-like Documents! Reframing VIE as relation prediction with a unified label space.
- 2024.01: π New technical paper released: Dynamic Relation Transformer for Contextual Text Block Detection!
- 2023.12: π 2nd Prize, 2023 International Algorithm Case Competition (Visual Prompt Tuning Challenge @ CVPR 2023 HIT Workshop), 200,000 RMB bonus!
- 2023.11: βοΈ Our new progress on Hierarchical Document Structure Analysis submitted to Pattern Recognition Journal.
- 2023.07: π "Robust Table Structure Recognition with Dynamic Queries Enhanced Detection Transformer" accepted by Pattern Recognition Journal!
- 2023.04: π Two papers accepted by ICDAR 2023!
- 2023.03: π‘ Proposed a new Dynamic Queries based Detection Transformer for more robust table structure recognition!
- 2022.12: π 2nd Prize, 2022 International Algorithm Case Competition (Panoptic Scene Graph Challenge @ ECCV 2022 SenseHuman Workshop), 100,000 RMB bonus!
- 2022.09: π One paper accepted by ACM MM 2022!
- 2025.4-Now: Research Intern, Seed Team, ByteDance
, Beijing, China.
- 2024.09-2025.03: Research Intern, Multimodal Interaction Group, Microsoft Research Asia
, Beijing, China.
- 2024.06-2024.08: AGI Research Intern, Multimodal LLM Team, DeepSeek
, Beijing, China.
- 2020.09-2024.05: Research Intern, Multimodal Interaction Group, Microsoft Research Asia
, Beijing, China.
- 2021.09-2026.6: Ph.D. in Information and Communication Engineering, University of Science and Technology of China, Hefei, Anhui, China.
- 2017.09-2021.06: B.S. in the School of the Gifted Young (major in Computer Science), University of Science and Technology of China, Hefei, Anhui, China.