开云足球俱乐部大模型研究协同创新中心 - 开云足球俱乐部 https://cs.nju.edu.cn/lm/en/ 开云足球俱乐部大模型研究协同创新中心 Hugo Blox Builder (https://hugoblox.com)enWed, 15 Jan 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/media/icon_hu_6873430e16214d30.png 开云足球俱乐部大模型研究协同创新中心 - 开云足球俱乐部 https://cs.nju.edu.cn/lm/en/ 3D interaction geometric pre-training for molecular relational learning https://cs.nju.edu.cn/lm/en/publication/lee-3-d-2025/ Sat, 11 Oct 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/lee-3-d-2025/ Eagle 2.5: boosting long-context post-training for frontier vision-language models https://cs.nju.edu.cn/lm/en/publication/chen-eagle-2025/ Sat, 11 Oct 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/chen-eagle-2025/ EgoExoBench: a benchmark for first-and third-person view video understanding in MLLMs https://cs.nju.edu.cn/lm/en/publication/he-egoexobench-2025/ Sat, 11 Oct 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/he-egoexobench-2025/ Gated integration of low-rank adaptation for continual learning of language models https://cs.nju.edu.cn/lm/en/publication/liang-gated-2025/ Sat, 11 Oct 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/liang-gated-2025/ LongVPO: from anchored cues to self-reasoning for long-form video preference optimization https://cs.nju.edu.cn/lm/en/publication/huang-longvpo-2025/ Sat, 11 Oct 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/huang-longvpo-2025/ Loquetier: a virtualized multi-LoRA framework for unified LLM fine-tuning and serving https://cs.nju.edu.cn/lm/en/publication/zhang-loquetier-2025/ Sat, 11 Oct 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/zhang-loquetier-2025/ MotionRAG: motion retrieval-augmented image-to-video generation https://cs.nju.edu.cn/lm/en/publication/zhu-motionrag-2025/ Sat, 11 Oct 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/zhu-motionrag-2025/ NeurIPS 2025 Accepted Papers Overview https://cs.nju.edu.cn/lm/en/post/2025-10-11-neurips-2025-accepted-papers/ Sat, 11 Oct 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/post/2025-10-11-neurips-2025-accepted-papers/ <p>NeurIPS (Annual Conference on Neural Information Processing Systems) is a top-tier conference in machine learning, alongside ICML and ICLR, recognized as one of the most challenging, highest-level, and most influential conferences in the field! NeurIPS is a CCF Class A conference and Core Conference Ranking Class A conference, with an H5 index of 278! Founded in 1987 in Canada by neural network scholars from the connectionist school, NeurIPS has grown in influence, with paper topics primarily focused on machine learning, artificial intelligence, and statistics.</p> <p>The Large Model Center at Nanjing University&rsquo;s School of Computer Science has 9 papers accepted to NeurIPS 2025.</p> <hr> <h3 id="01">01</h3> <p><strong>Title</strong>: <a href="https://arxiv.org/pdf/2505.15424" target="_blank">Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models</a></p> <p><strong>Authors</strong>: Yan-Shuo Liang, Jia-Rui Chen, Wu-Jun Li</p> <p><strong>Institution</strong>: Nanjing University</p> <p><strong>Abstract</strong>:</p> <p>Thanks to the rich knowledge obtained from large-scale pre-training and subsequent fine-tuning strategies, existing large language models (LLMs) have demonstrated excellent performance across a wide range of tasks. However, when LLMs learn multiple downstream tasks sequentially, they often forget previously learned knowledge, leading to significant performance degradation on old tasks—a phenomenon known as catastrophic forgetting. Catastrophic forgetting hinders LLMs from continuously accumulating new knowledge, making it crucial to design continual learning methods that can overcome this challenge. Meanwhile, Low-Rank Adaptation (LoRA), as one of the most representative methods in parameter-efficient fine-tuning, has gained widespread attention in continual learning for LLMs. LoRA reparameterizes pre-trained weights into low-rank forms, requiring only a small number of parameters to be updated for task adaptation. Compared to full parameter updates, LoRA significantly improves fine-tuning efficiency. However, existing LoRA-based continual learning methods still have limitations. They typically expand new LoRA branches when learning new tasks while freezing old branches, thereby avoiding forgetting caused by directly modifying old parameters. During inference, these methods usually adopt simple addition to integrate new and old branches. This approach forces new and old branches to contribute equally to old tasks, which may instead cause new branches to significantly interfere with old tasks, exacerbating forgetting and reducing overall performance. To address this, we propose GainLoRA (gated integration of low-rank adaptation), a new continual learning method for LLMs. GainLoRA expands new LoRA branches for each new task and dynamically integrates new and old branches through a gating module. By imposing initialization and update constraints on the new gating module, GainLoRA significantly reduces interference from new LoRA branches on old tasks, effectively mitigating forgetting and improving the overall performance of LLMs in continual learning.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="Figure 1" srcset=" /lm/post/2025-10-11-neurips-2025-accepted-papers/paper1_hu_900138511c778587.jpg 400w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper1_hu_9bdee66c1a32537.jpg 760w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper1_hu_966cfcefda83c4a0.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-10-11-neurips-2025-accepted-papers/paper1_hu_900138511c778587.jpg" width="760" height="215" loading="lazy" data-zoomable /></div> </div></figure> </p> <p style="text-align: center; font-size: 0.9em; color: #666; margin-top: 5px;"> Figure 1 </p> <hr> <h3 id="02">02</h3> <p><strong>Title</strong>: StreamForest: Efficient Online Video Understanding with Persistent Event Memory</p> <p><strong>Authors</strong>: Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, Limin Wang</p> <p><strong>Institution</strong>: Nanjing University, Shanghai AI Laboratory, Zhejiang University, Huawei Noah&rsquo;s Ark Lab, Yinwang Intelligent Technology</p> <p><strong>Abstract</strong>:</p> <p>Multimodal large language models have made significant progress in video understanding in recent years. However, due to historical visual feature storage limitations and insufficient real-time spatiotemporal reasoning capabilities, their effectiveness in real-time streaming scenarios remains limited. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. The core of StreamForest is the Persistent Event Memory Forest, a memory mechanism that can adaptively organize video frames into multiple event-level tree structures. This process is guided by a penalty function based on temporal distance, content similarity, and merging frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce the Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we propose OnlineIT, an instruction tuning dataset customized for streaming video tasks. OnlineIT significantly improves MLLM performance in real-time perception and future prediction. To evaluate its generalization ability in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results show that StreamForest achieves state-of-the-art performance, reaching 77.3% accuracy on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. Notably, even under extreme visual token compression (limited to 1024 tokens), the model maintains 96.8% average accuracy across eight benchmarks (relative to the default 8k setting). These results highlight StreamForest&rsquo;s robustness, efficiency, and versatility in streaming video understanding.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="Figure 2" srcset=" /lm/post/2025-10-11-neurips-2025-accepted-papers/paper2_hu_c0e89ab8aeb1c95b.jpg 400w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper2_hu_e59c708dfc7055d6.jpg 760w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper2_hu_5cd37fb594001b42.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-10-11-neurips-2025-accepted-papers/paper2_hu_c0e89ab8aeb1c95b.jpg" width="760" height="238" loading="lazy" data-zoomable /></div> </div></figure> </p> <p style="text-align: center; font-size: 0.9em; color: #666; margin-top: 5px;"> Figure 2 </p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="Figure 3" srcset=" /lm/post/2025-10-11-neurips-2025-accepted-papers/paper3_hu_3198090095582911.jpg 400w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper3_hu_232670da05d46e19.jpg 760w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper3_hu_97aa7f1389c33db1.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-10-11-neurips-2025-accepted-papers/paper3_hu_3198090095582911.jpg" width="760" height="322" loading="lazy" data-zoomable /></div> </div></figure> </p> <p style="text-align: center; font-size: 0.9em; color: #666; margin-top: 5px;"> Figure 3 </p> <hr> <h3 id="03">03</h3> <p><strong>Title</strong>: LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization</p> <p><strong>Authors</strong>: Zhenpeng Huang, Jiaqi Li, Zihan Jia, Xinhao Li, Desen Meng, Lingxue Song, Xi Chen, Liang Li, Limin Wang</p> <p><strong>Institution</strong>: Nanjing University, China Mobile Research Institute</p> <p><strong>Abstract</strong>:</p> <p>Current vision-language models (VLMs) have limited performance in long video understanding: they rely on expensive and scarce long video annotations, and short-context models easily overlook intermediate content when extended to long sequences, causing performance imbalance between long and short tasks. To address this, we propose LongVPO—a two-stage direct preference optimization framework that requires no long video annotations. LongVPO first uses &ldquo;anchored cues&rdquo; to automatically synthesize preference data from short video clips, then achieves cross-clip alignment through &ldquo;self-reasoning&rdquo; on real long videos, learning complex long-range reasoning capabilities. Using only 16K synthetic data, LongVPO achieves superior performance on LVBench, LongVideoBench, MLVU, VideoMME, and other benchmarks while maintaining strong performance on short video tasks, providing a new paradigm for efficient and scalable long video understanding.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="Figure 4" srcset=" /lm/post/2025-10-11-neurips-2025-accepted-papers/paper4_hu_17f385abb0b3ffff.jpg 400w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper4_hu_f0626d540aeef0e4.jpg 760w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper4_hu_4ccef06f47458e71.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-10-11-neurips-2025-accepted-papers/paper4_hu_17f385abb0b3ffff.jpg" width="760" height="613" loading="lazy" data-zoomable /></div> </div></figure> </p> <p style="text-align: center; font-size: 0.9em; color: #666; margin-top: 5px;"> Figure 4 </p> <hr> <h3 id="04">04</h3> <p><strong>Title</strong>: Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models</p> <p><strong>Authors</strong>: Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, Tyler Poon, Max Ehrlich, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, Guilin Liu</p> <p><strong>Institution</strong>: Nanjing University, NVIDIA, Hong Kong Polytechnic University, Rutgers University</p> <p><strong>Abstract</strong>:</p> <p>Eagle 2.5 is a series of frontier vision-language models (VLMs) designed for long-context multimodal understanding. Existing VLMs mainly focus on short-context tasks, with insufficient support for long video understanding and high-resolution image processing. Eagle 2.5 proposes a general training framework with two core technologies: Automatic Degradation Sampling (ADS) and Image Area Preservation (IAP), which dynamically allocate visual and text input budgets and maintain image integrity when segmenting. Additionally, the authors introduce a progressive mixed post-training strategy that gradually extends context length to improve model stability in handling diverse inputs. To support training, they construct the new Eagle-Video-110K dataset, providing story-level and clip-level dual annotations to enhance long video understanding capabilities. Experiments show that Eagle 2.5 achieves significant improvements on multiple long video and image understanding benchmarks. For example, the 8B parameter Eagle 2.5 achieves 72.4% on Video-MME with 512 frame input, approaching the performance of larger models like GPT-4o and Qwen2.5-VL-72B. The model also performs excellently on high-resolution image understanding tasks. In summary, Eagle 2.5 achieves efficient and powerful long-context multimodal understanding capabilities through innovative sampling strategies, progressive training methods, and large-scale multi-level datasets, providing a strong direction for future high-performance VLM development.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="Figure 5" srcset=" /lm/post/2025-10-11-neurips-2025-accepted-papers/paper5_hu_83b2ccb30ba5476a.jpg 400w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper5_hu_fed6b282eb04dbef.jpg 760w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper5_hu_6389b93d49961a5c.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-10-11-neurips-2025-accepted-papers/paper5_hu_83b2ccb30ba5476a.jpg" width="665" height="492" loading="lazy" data-zoomable /></div> </div></figure> </p> <p style="text-align: center; font-size: 0.9em; color: #666; margin-top: 5px;"> Figure 5 </p> <hr> <h3 id="05">05</h3> <p><strong>Title</strong>: VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception</p> <p><strong>Authors</strong>: Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, Yi Wang</p> <p><strong>Institution</strong>: Zhejiang University, Shanghai AI Laboratory, Nanjing University, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences</p> <p><strong>Abstract</strong>:</p> <p>Infusing reasoning capabilities into multimodal large language models is key to achieving human-level perception and understanding. Existing methods mostly rely on the reasoning capabilities of LLMs to analyze parsed visual information, but are often limited by static perception stages. This paper proposes &ldquo;Visual Test-Time Scaling,&rdquo; which enhances multimodal LLM reasoning capabilities through iterative perception during inference. Under the guidance of updated text predictions, it gradually refines attention to high-confidence spatiotemporal regions, mimicking human hierarchical attention mechanisms. The training process combines reinforcement learning with spatiotemporal supervision signals for end-to-end optimization of reasoning paths. These designs allow multimodal LLMs to improve performance by increasing perception computational capacity. Extensive experiments validate the effectiveness and generalization of the iterative perception method across various tasks and benchmarks. The newly introduced Videochat-R1.5 model achieves significant improvements across more than 15 benchmarks covering video dialogue, video reasoning, and spatiotemporal perception, with an average improvement of more than 5% compared to robust baselines like Qwen2.5VL-3B and -7B.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="Figure 6" srcset=" /lm/post/2025-10-11-neurips-2025-accepted-papers/paper6_hu_61c1219b22d57ac.jpg 400w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper6_hu_89ca2205e711afdf.jpg 760w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper6_hu_22b7da8cdb804e65.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-10-11-neurips-2025-accepted-papers/paper6_hu_61c1219b22d57ac.jpg" width="760" height="421" loading="lazy" data-zoomable /></div> </div></figure> </p> <p style="text-align: center; font-size: 0.9em; color: #666; margin-top: 5px;"> Figure 6 </p> <hr> <h3 id="06">06</h3> <p><strong>Title</strong>: MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation</p> <p><strong>Authors</strong>: Chenhui Zhu, Yilu Wu, Shuai Wang, Gangshan Wu, Limin Wang</p> <p><strong>Institution</strong>: Nanjing University</p> <p><strong>Abstract</strong>:</p> <p>Thanks to the development of diffusion models, image-to-video generation technology has made significant progress. However, generating motion-realistic videos remains a formidable challenge. The core of this challenge lies in accurately modeling the complexity of motion, which requires capturing physical laws, object interactions, and domain-specific motion patterns—prior knowledge that is difficult to generalize effectively across diverse scenarios. To address this, we propose MotionRAG, a retrieval-augmented generation framework. This framework extracts and transfers motion priors from relevant reference videos through a Context-Aware Motion Adaptation (CAMA) mechanism to improve the motion realism of generated videos. The core technical innovations include: (1) Retrieval-based motion representation extraction: using video encoders and resamplers to extract semantic-level motion features from retrieved reference videos; (2) Context learning-based motion adaptation method: efficiently learning and transferring motion patterns from multiple retrieved reference videos to target scenarios through a causal Transformer architecture; (3) Attention motion injection adapter: injecting motion features into pre-trained video diffusion models to enhance motion realism. Extensive experiments demonstrate that our method achieves significant improvements across multiple scenarios and various base models, introducing only negligible computational overhead during inference. Furthermore, its modular design supports zero-shot generalization to new domains—simply updating the retrieval database without retraining any model components. This research enhances the core capabilities of video generation systems by enabling efficient retrieval and transfer of motion priors, providing a new paradigm for synthesizing videos with realistic dynamic effects.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="Figure 7" srcset=" /lm/post/2025-10-11-neurips-2025-accepted-papers/paper7_hu_4a364ebe2e54e02f.jpg 400w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper7_hu_9f204335c7c4c910.jpg 760w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper7_hu_ded8ead9486f0b5.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-10-11-neurips-2025-accepted-papers/paper7_hu_4a364ebe2e54e02f.jpg" width="760" height="250" loading="lazy" data-zoomable /></div> </div></figure> </p> <p style="text-align: center; font-size: 0.9em; color: #666; margin-top: 5px;"> Figure 7 </p> <hr> <h3 id="07">07</h3> <p><strong>Title</strong>: Loquetier: A Virtualized Multi-LoRA Framework for Unified LLM Fine-tuning and Serving</p> <p><strong>Authors</strong>: Yuchen Zhang, Hanyue Du, Chun Cao, Jingwei Xu</p> <p><strong>Institution</strong>: Nanjing University</p> <p><strong>Abstract</strong>:</p> <p>Low-Rank Adaptation (LoRA) has become a widely adopted parameter-efficient fine-tuning (PEFT) technique for adapting large language models (LLMs) to downstream tasks. Although numerous studies have explored strategies for unifying LLM training and serving, the domain of unified fine-tuning and inference for LoRA-based models remains underexplored. This paper proposes Loquetier—a virtualized multi-LoRA framework that seamlessly integrates LoRA fine-tuning and inference serving in a single runtime environment. Loquetier consists of two main components: (1) a virtualization module that isolates PEFT-based model modifications and supports deploying multiple adapters on a shared single base model; (2) an optimized computational flow with kernel designs that fuse fine-tuning and inference paths in forward propagation, enabling efficient batch processing and minimizing kernel call overhead. In extensive experiments across three task scenarios, Loquetier significantly outperforms existing baselines in both performance and flexibility: achieving 3.0× throughput of top co-serving systems in inference-only tasks, and 46.4× higher service level objective attainment rate than PEFT in unified fine-tuning and inference tasks.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="Figure 8" srcset=" /lm/post/2025-10-11-neurips-2025-accepted-papers/paper8_hu_e635fbc5a4f82fdb.jpg 400w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper8_hu_58857d8840a945d7.jpg 760w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper8_hu_adbe11bc41341741.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-10-11-neurips-2025-accepted-papers/paper8_hu_e635fbc5a4f82fdb.jpg" width="760" height="703" loading="lazy" data-zoomable /></div> </div></figure> </p> <p style="text-align: center; font-size: 0.9em; color: #666; margin-top: 5px;"> Figure 8 </p> <hr> <h3 id="08">08</h3> <p><strong>Title</strong>: 3D Interaction Geometric Pre-training for Molecular Relational Learning</p> <p><strong>Authors</strong>: Namkyeong Lee, Yunhak Oh, Heewoong Noh, Gyoung S. Na, Minkai Xu, Hanchen Wang, Tianfan Fu, Chanyoung Park</p> <p><strong>Institution</strong>: KAIST, KRICT, Stanford University, Genentech, Nanjing University</p> <p><strong>Abstract</strong>:</p> <p>Accurate prediction of molecular interactions is crucial in drug discovery and materials science. However, existing molecular relational learning methods are mostly limited to using 2D topological structures of molecules, ignoring 3D spatial geometric information that determines the nature of interactions—primarily because obtaining precise 3D interaction conformations is extremely expensive. To break through this bottleneck, we propose 3DMRL, an innovative 3D geometric pre-training framework. The core of this framework is that instead of relying on expensive computations to obtain true interaction conformations, it simulates how molecules contact each other in 3D space by constructing a &ldquo;virtual interaction environment&rdquo;—arranging multiple small molecules around a large molecule through random sampling, translation, and rotation. Based on this, we design dual pre-training tasks to guide 2D models to learn 3D geometric information in this virtual environment: one uses contrastive learning to help models understand the global geometric structure of interactions; the other uses an equivariant network to predict fine local relative geometric relationships between molecules, capturing atomic-level interaction details. Extensive experiments show that 3DMRL can significantly improve the performance of various mainstream models on molecular interaction prediction and drug-drug interaction prediction tasks, achieving up to 24.93% performance improvement across 40 tasks and demonstrating excellent generalization capabilities in out-of-distribution scenarios. This work systematically introduces 3D geometric pre-training to the field of molecular relational learning for the first time, laying a solid foundation for developing more accurate and versatile AI-assisted scientific discovery tools.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="Figure 9" srcset=" /lm/post/2025-10-11-neurips-2025-accepted-papers/paper9_hu_83ecfc2b4544dc66.jpg 400w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper9_hu_36930691ca15c2c8.jpg 760w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper9_hu_f9bd9b27e73fdd4e.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-10-11-neurips-2025-accepted-papers/paper9_hu_83ecfc2b4544dc66.jpg" width="760" height="386" loading="lazy" data-zoomable /></div> </div></figure> </p> <p style="text-align: center; font-size: 0.9em; color: #666; margin-top: 5px;"> Figure 9 </p> <hr> <h3 id="09">09</h3> <p><strong>Title</strong>: EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs</p> <p><strong>Authors</strong>: Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, Jiangmiao Pang</p> <p><strong>Institution</strong>: Nanjing University, Shanghai AI Laboratory, University of Tokyo, Zhejiang University, Fudan University</p> <p><strong>Abstract</strong>:</p> <p>Human intelligence can naturally transfer and integrate knowledge between first-person (egocentric) and third-person (exocentric) perspectives, which is crucial for learning and communication. However, although current multimodal large language models (MLLMs) have achieved significant progress in single-perspective video understanding, they still lack systematic evaluation of cross-perspective reasoning. To address this, we propose EgoExoBench—the first benchmark for evaluating MLLMs&rsquo; first-person and third-person video understanding and reasoning capabilities.</p> <p>EgoExoBench is built on public datasets and contains 7300+ multiple-choice questions (MCQs) covering 11 sub-tasks, divided into three major challenges: semantic alignment, viewpoint association, and temporal reasoning. Task designs cover matching from task, action, object, to person levels, as well as cross-perspective spatial correspondence and event sequence reasoning.</p> <p>The research team conducted systematic evaluation of 13 mainstream open-source and closed-source MLLMs (such as GPT-4o, Claude 3.7 Sonnet, Qwen2.5-VL, InternVL3, etc.). Results show that these models perform well on single-perspective tasks but exhibit significant performance degradation on cross-perspective tasks. For example, the best open-source model Qwen2.5-VL-72B achieves only 47% overall accuracy, while humans achieve over 90% accuracy on the same tasks. Further experiments show that chain-of-thought (CoT) prompting does not improve performance and even reduces accuracy on some tasks, indicating that cross-perspective reasoning remains a major challenge for existing models.</p> <p>In summary, EgoExoBench provides a systematic and scalable evaluation framework that helps advance embodied agents and human-robot collaboration systems with human-like cross-perspective intelligence.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="Figure 10" srcset=" /lm/post/2025-10-11-neurips-2025-accepted-papers/paper10_hu_6c9d652f47e56173.jpg 400w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper10_hu_700d27bcd973b183.jpg 760w, /lm/post/2025-10-11-neurips-2025-accepted-papers/paper10_hu_130e4e1931c99ce7.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-10-11-neurips-2025-accepted-papers/paper10_hu_6c9d652f47e56173.jpg" width="486" height="760" loading="lazy" data-zoomable /></div> </div></figure> </p> <p style="text-align: center; font-size: 0.9em; color: #666; margin-top: 5px;"> Figure 10 </p> StreamForest: efficient online video understanding with persistent event memory https://cs.nju.edu.cn/lm/en/publication/zeng-streamforest-2025/ Sat, 11 Oct 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/zeng-streamforest-2025/ VideoChat-r1.5: visual test-time scaling to reinforce multimodal reasoning by iterative perception https://cs.nju.edu.cn/lm/en/publication/yan-videochat-2025/ Sat, 11 Oct 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/yan-videochat-2025/ Professor Wang Limin Receives 2025 Ant Intech Technology Award https://cs.nju.edu.cn/lm/en/post/2025-09-19-wanglimin-ant-intech-award/ Fri, 19 Sep 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/post/2025-09-19-wanglimin-ant-intech-award/ <p>Recently, at the 2025 Inclusion Bund Conference, the &ldquo;2025 Ant Intech Award&rdquo; was officially announced. 10 young scientists received the &ldquo;Ant Intech Technology Award&rdquo;. At the same time, 10 Chinese doctoral students from top universities worldwide received the &ldquo;Ant Intech Scholarship&rdquo;. Among them, Professor Wang Limin received the 2025 Ant Intech Technology Award.</p> <p>The 2025 Ant Intech Award is established by Ant Group Co., Ltd., providing public welfare research funding support for outstanding young scholars and doctoral students in the field of computer science, with two core awards: the &ldquo;Ant Intech Technology Award&rdquo; and the &ldquo;Ant Intech Scholarship&rdquo;.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="2025 Ant Intech Technology Award Ceremony" srcset=" /lm/post/2025-09-19-wanglimin-ant-intech-award/award_ceremony_hu_eb6fe368fc88b0e4.jpg 400w, /lm/post/2025-09-19-wanglimin-ant-intech-award/award_ceremony_hu_71d40b414ff63c00.jpg 760w, /lm/post/2025-09-19-wanglimin-ant-intech-award/award_ceremony_hu_3c0a880a5b52ec84.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-09-19-wanglimin-ant-intech-award/award_ceremony_hu_eb6fe368fc88b0e4.jpg" width="760" height="507" loading="lazy" data-zoomable /></div> </div></figure> </p> <p style="text-align: center; font-size: 0.9em; color: #666; margin-top: 5px;"> Figure: 2025 Ant Intech Technology Award Ceremony </p> <p>Academicians and industry authorities attended the award ceremony, including Chen Chun (Academician of Chinese Academy of Engineering, Professor at Zhejiang University), Zhang Hongjiang (Foreign Academician of US National Academy of Engineering), and Zheng Weimin (Academician of Chinese Academy of Engineering, Professor at Tsinghua University). Michael I. Jordan (Member of US National Academy of Sciences, Engineering, and Arts &amp; Sciences) and Jack Dongarra (Turing Award winner, Academician of US National Academy of Engineering, Professor at University of Tennessee) sent video messages to young scholars: &ldquo;The path of research may not be smooth, but the problems you explore today will define future technologies and opportunities. Be bold in seeking truth, and your research will ultimately impact the world.&rdquo;</p> <p>It is understood that this year&rsquo;s award recipients have demonstrated exceptional innovation capabilities in frontier areas such as Artificial General Intelligence (AGI), embodied intelligence, digital medicine, and data security, with their achievements being widely adopted by the industry. Professor Wang Limin won the award for his significant contributions to Artificial General Intelligence. The award citation: developed the first leading general video understanding large model InternVideo (with over 5 million downloads), proposed the &ldquo;progressive training&rdquo; method, enabling AI to understand the dynamic world in layers like humans, and empowered application scenarios such as autonomous driving.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="Professor Wang Limin Participating in Round Table Forum" srcset=" /lm/post/2025-09-19-wanglimin-ant-intech-award/forum_discussion_hu_8316d3b949f34aea.jpg 400w, /lm/post/2025-09-19-wanglimin-ant-intech-award/forum_discussion_hu_45013ee217b17f01.jpg 760w, /lm/post/2025-09-19-wanglimin-ant-intech-award/forum_discussion_hu_5c32ee2693b06f1b.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-09-19-wanglimin-ant-intech-award/forum_discussion_hu_8316d3b949f34aea.jpg" width="760" height="507" loading="lazy" data-zoomable /></div> </div></figure> </p> <p style="text-align: center; font-size: 0.9em; color: #666; margin-top: 5px;"> Figure: Professor Wang Limin participating in the 2025 Ant Intech Technology Award ceremony round table forum </p> Correspondence as video: test-time adaption on SAM2 for reference segmentation in the wild https://cs.nju.edu.cn/lm/en/publication/wang-correspondence-2025/ Tue, 12 Aug 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/wang-correspondence-2025/ Divide-and-conquer for enhancing unlabeled learning, stability, and plasticity in semi-supervised continual learning https://cs.nju.edu.cn/lm/en/publication/duan-divide-2025/ Tue, 12 Aug 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/duan-divide-2025/ ICCV 2025 Accepted Papers https://cs.nju.edu.cn/lm/en/post/2025-09-19-iccv2025-accepted-papers/ Tue, 12 Aug 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/post/2025-09-19-iccv2025-accepted-papers/ <blockquote> <p>ICCV is one of the most influential top-tier conferences in computer vision. It is organized by the IEEE Computer Society and held biennially alongside CVPR and ECCV as the three flagship vision venues. ICCV covers cutting-edge topics such as image processing, object detection, 3D reconstruction, video understanding, and vision–language research, serving as a premier platform for presenting the latest advances and exchanging ideas. With its very high acceptance standards, ICCV represents the frontier trends and research hotspots of the field.</p> <p>Seven papers from the Large Model Center of the Department of Computer Science and Technology, Nanjing University (NJU MCG), have been accepted to ICCV 2025.</p></blockquote> <h1 id="01">01</h1> <p><strong>Title:</strong> MobileViCLIP: An Efficient Video-Text Model for Mobile Devices</p> <p><strong>Authors:</strong> Min Yang, Zihan Jia, Zhilin Dai, Sheng Guo, Limin Wang</p> <p><strong>Affiliations:</strong> Nanjing University; Ant Group</p> <p><strong>Abstract:</strong></p> <p>Although large models have achieved strong performance on many vision tasks, efficient lightweight neural networks are receiving growing attention due to their faster inference and easier deployment on mobile devices. However, existing video models still focus on larger ViT architectures, with few attempts to build efficient video architectures. Given that many efficient CLIP models already demonstrate strong zero-shot classification and retrieval capabilities, we aim to fill the gap for video–text understanding and propose MobileViCLIP, a fast and efficient video–text model with strong zero-shot capability that can be deployed on mobile devices. Concretely, MobileViCLIP achieves performance comparable to mainstream ViT-based models on several text–video retrieval and zero-shot video classification datasets, while improving inference speed on mobile devices by tens of times. We believe focusing on efficiency for video–text models is important and valuable to the field.</p> <h1 id="02">02</h1> <p><strong>Title:</strong> p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay</p> <p><strong>Authors:</strong> Jun Zhang (张峻), Desen Meng (孟德森), Zhengming Zhang (张拯明), Zhenpeng Huang (黄振鹏), Tao Wu (吴涛), Limin Wang (王利民)</p> <p><strong>Affiliations:</strong> Nanjing University; China Mobile Research Institute</p> <p><strong>Abstract:</strong></p> <p>Despite the strong performance of multimodal large language models (MLLMs) on various downstream tasks, their massive training and inference costs hinder further development. A major cause is that the LLM must process an enormous number of visual tokens. We propose p-MoD, an efficient MLLM architecture that significantly reduces computational cost during both training and inference while maintaining performance. To reduce the number of visual tokens processed at each LLM Transformer layer, p-MoD introduces a Mixture-of-Depths (MoD) mechanism that processes only the most informative tokens at each layer and skips redundant ones. Integrating MoD into MLLMs is nontrivial; to address training/inference stability and limited training data, p-MoD designs Tanh-gated Weight Normalization (TanhNorm) and Symmetric Token Reweighting (STRing). Furthermore, we observe that visual token redundancy increases in deeper layers and thus propose Progressive Ratio Decay (PRD) to gradually reduce the kept-token ratio layer by layer. This key design fully unlocks MoD’s potential, markedly boosting efficiency and performance. On 15 benchmarks with LLaVA-1.5 and LLaVA-NeXT baselines, p-MoD matches or surpasses performance while using 55.6% inference TFLOPs, 53.7% KV cache, and 77.7% GPU training time. <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-02" srcset=" /lm/post/2025-09-19-iccv2025-accepted-papers/2_hu_1b31a54505924d75.jpg 400w, /lm/post/2025-09-19-iccv2025-accepted-papers/2_hu_95917eef9098f48e.jpg 760w, /lm/post/2025-09-19-iccv2025-accepted-papers/2_hu_e904ff494b454846.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-09-19-iccv2025-accepted-papers/2_hu_1b31a54505924d75.jpg" width="760" height="317" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="03">03</h1> <p><strong>Title:</strong> Scalable Image Tokenization with Index Backpropagation Quantization</p> <p><strong>Authors:</strong> Fengyuan Shi (石丰源), Zhuoyan Luo (罗卓彦), Yixiao Ge (葛艺潇), Yujiu Yang (杨余久), Ying Shan (单瀛), Limin Wang (王利民)</p> <p><strong>Affiliations:</strong> Nanjing University; Tsinghua University; Tencent</p> <p><strong>Abstract:</strong></p> <p>Existing vector quantization (VQ) methods face scalability issues, largely because codebooks updated only partially during training become unstable: as the distribution gap between inactive codes and visual features widens, codebook utilization drops and training eventually collapses. We propose Index Backpropagation Quantization (IBQ), a new VQ method that jointly optimizes all codebook embeddings and the visual encoder. By applying a straight-through estimator to the one-hot categorical distribution between encoded features and the codebook, IBQ makes all codes differentiable and maintains a latent space consistent with the visual encoder. IBQ enables scalable training of visual tokenizers and, for the first time, achieves large codebooks with high utilization at high dimension (256) and scale (2¹⁸). On ImageNet, IBQ shows strong scalability and competitive performance for both image reconstruction and autoregressive visual generation. <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-03" srcset=" /lm/post/2025-09-19-iccv2025-accepted-papers/3_hu_5f13c42772308603.jpg 400w, /lm/post/2025-09-19-iccv2025-accepted-papers/3_hu_cb110f10d460d9c3.jpg 760w, /lm/post/2025-09-19-iccv2025-accepted-papers/3_hu_99bbd998f5d2bf77.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-09-19-iccv2025-accepted-papers/3_hu_5f13c42772308603.jpg" width="760" height="562" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="04">04</h1> <p><strong>Title:</strong> Make Your Training Flexible: Towards Deployment-Efficient Video Models</p> <p><strong>Authors:</strong> Chenting Wang, Kunchang Li, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, Limin Wang</p> <p><strong>Affiliations:</strong> Shanghai AI Laboratory; Shanghai Jiao Tong University; University of Science and Technology of China; Nanjing University</p> <p><strong>Abstract:</strong></p> <p>Mainstream video training typically relies on fixed spatiotemporal sampling grids that extract a fixed number of visual tokens as input, making both training and inference heavily constrained by preset sampling strategies. This rigid design hampers adaptation to varying computational budgets in downstream scenarios—especially when models trained under high compute cannot be efficiently deployed on resource-limited edge devices. We propose a new training paradigm to achieve “lossless adaptation across scenarios”: retain top performance under high compute while enabling lossless migration to low-resource environments. We first introduce Token Optimization (TO), an adaptive inference framework that dynamically samples and selects tokens according to downstream compute limits to maximize information utilization. We then develop Flux, a training-side data augmentation tool that enables flexible sampling grids with token selection, integrating seamlessly into mainstream video training frameworks to markedly enhance robustness and flexibility at near-zero extra cost. Integrated into large-scale video pretraining, FluxViT sets new SOTA under standard compute; notably, with only 1/4 tokens, FluxViT with TO still rivals the best InternVideo2 models, saving nearly 90% compute without loss. <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-04" srcset=" /lm/post/2025-09-19-iccv2025-accepted-papers/4_hu_659de66e92cd6d00.jpg 400w, /lm/post/2025-09-19-iccv2025-accepted-papers/4_hu_17c155a99fb15eab.jpg 760w, /lm/post/2025-09-19-iccv2025-accepted-papers/4_hu_f76fc51dc43bac47.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-09-19-iccv2025-accepted-papers/4_hu_659de66e92cd6d00.jpg" width="656" height="760" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="05">05</h1> <p><strong>Title:</strong> VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos</p> <p><strong>Authors:</strong> Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang</p> <p><strong>Affiliations:</strong> Shanghai AI Laboratory; Nanjing University; SIAT, Chinese Academy of Sciences (Shenzhen)</p> <p><strong>Abstract:</strong></p> <p>We introduce VRBench—the first long-form narrative video benchmark specifically designed to evaluate multi-step reasoning in large models—addressing limitations of existing evaluations that overlook temporal reasoning and process validity. VRBench contains 1,010 long videos (avg. length 1.6 hours), 9,468 human-annotated multi-step QA pairs, and 30,292 timestamped reasoning steps. Videos are curated through a multi-stage pipeline with expert cross-check, ensuring coherent plots and complexity. We build a human-in-the-loop framework to generate coherent chains-of-reasoning with timestamped steps across seven types (e.g., causal attribution, implicit reasoning). A multi-stage evaluation assesses models by both results and processes: beyond MCQ results, we propose an LLM-guided process score to comprehensively assess reasoning-chain quality. Experiments with 12 LLMs and 16 VLMs reveal current limitations in long-video multi-step reasoning and offer recommendations. <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-05" srcset=" /lm/post/2025-09-19-iccv2025-accepted-papers/5_hu_b621854db51777f5.jpg 400w, /lm/post/2025-09-19-iccv2025-accepted-papers/5_hu_f6865e8cf0c0a0b8.jpg 760w, /lm/post/2025-09-19-iccv2025-accepted-papers/5_hu_1f224cd1f030d16d.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-09-19-iccv2025-accepted-papers/5_hu_b621854db51777f5.jpg" width="760" height="309" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="06">06</h1> <p><strong>Title:</strong> Divide-and-Conquer for Enhancing Unlabeled Learning, Stability, and Plasticity in Semi-supervised Continual Learning</p> <p><strong>Authors:</strong> Yue Duan (段岳), Taicai Chen (陈泰财), Lei Qi (祁磊), Yinghuan Shi (史颖欢)</p> <p><strong>Affiliations:</strong> Nanjing University; Southeast University</p> <p><strong>Links:</strong> <a href="https://arxiv.org/abs/2508.05316" target="_blank" rel="noopener">https://arxiv.org/abs/2508.05316</a>, <a href="https://github.com/NJUyued/USP4SSCL" target="_blank" rel="noopener">https://github.com/NJUyued/USP4SSCL</a></p> <p><strong>Abstract:</strong></p> <p>Semi-supervised continual learning (SSCL) aims to learn from a sequence of tasks in which only part of the data is labeled—highly practical yet challenging. The core challenge is to effectively leverage unlabeled data while balancing memory stability (avoiding forgetting) and learning plasticity (learning new knowledge). We propose USP, a divide-and-conquer collaborative framework that systematically enhances Unlabeled learning, Stability, and Plasticity via three coupled modules. For plasticity, we propose Feature Space Reservation (FSR), which uses an Equiangular Tight Frame (ETF) to reserve positions in the feature space for future classes, reducing conflicts when learning new tasks. For unlabeled learning, we design Divide-and-Conquer Pseudo-labeling (DCP), which splits unlabeled data into high- and low-confidence subsets and assigns pseudo-labels using a classifier and a more robust Nearest Class Mean (NCM), respectively, fully utilizing all data. For stability, we introduce Class-mean anchored Unlabeled Distillation (CUD), which reuses DCP’s intermediate results and anchors unlabeled data to stable class centers computed from labeled data, effectively mitigating catastrophic forgetting. Extensive experiments show that USP significantly outperforms SOTA, improving final-task accuracy by up to 5.94%. <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-06" srcset=" /lm/post/2025-09-19-iccv2025-accepted-papers/6_hu_74024f7307e36a05.jpg 400w, /lm/post/2025-09-19-iccv2025-accepted-papers/6_hu_a32e2ac3c775635b.jpg 760w, /lm/post/2025-09-19-iccv2025-accepted-papers/6_hu_25159b1128c5bb83.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-09-19-iccv2025-accepted-papers/6_hu_74024f7307e36a05.jpg" width="446" height="512" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="07">07</h1> <p><strong>Title:</strong> Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild</p> <p><strong>Authors:</strong> Haoran Wang (王皓冉), Zekun Li (李泽昆), Jian Zhang (张剑), Lei Qi (祁磊), Yinghuan Shi (史颖欢)</p> <p><strong>Affiliations:</strong> Nanjing University; Southeast University</p> <p><strong>Links:</strong> <a href="https://arxiv.org/abs/2508.07759" target="_blank" rel="noopener">https://arxiv.org/abs/2508.07759</a>, <a href="https://github.com/wanghr64/cav-sam" target="_blank" rel="noopener">https://github.com/wanghr64/cav-sam</a></p> <p><strong>Abstract:</strong></p> <p>Large vision models (e.g., SAM) often degrade on downstream segmentation tasks involving new domains or categories. Reference Segmentation, which uses an annotated reference image to guide the segmentation of a target image, is a promising direction. However, existing methods largely rely on meta-learning, requiring heavy training data and compute. We propose CAV-SAM, a new paradigm that turns the “correspondence” between the reference and target images into a “pseudo video,” enabling the latest video model SAM2 to adapt effectively through lightweight test-time tuning, completely avoiding costly meta-learning. The framework includes: (1) Diffusion-based Semantic Transition (DBST), which generates a smooth semantic transition sequence (pseudo video) from the reference to the target to handle semantic differences (same class, different instances); and (2) Test-Time Geometric Alignment (TTGA), which performs lightweight tuning of SAM2 using only the reference image and a novel enhanced cycle-consistency loss to better align geometric changes (pose, scale). Without meta-learning, CAV-SAM surpasses prior SOTA by about 5% on average across multiple datasets. <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-07" srcset=" /lm/post/2025-09-19-iccv2025-accepted-papers/7_hu_7871e4b513c3546.jpg 400w, /lm/post/2025-09-19-iccv2025-accepted-papers/7_hu_660fc3c0b5b31cde.jpg 760w, /lm/post/2025-09-19-iccv2025-accepted-papers/7_hu_aa72482118adab90.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-09-19-iccv2025-accepted-papers/7_hu_7871e4b513c3546.jpg" width="760" height="320" loading="lazy" data-zoomable /></div> </div></figure> </p> Make your training flexible: towards deployment-efficient video models https://cs.nju.edu.cn/lm/en/publication/wang-make-2025/ Tue, 12 Aug 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/wang-make-2025/ MobileViCLIP: an efficient video-text model for mobile devices https://cs.nju.edu.cn/lm/en/publication/yang-mobileviclip-2025/ Tue, 12 Aug 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/yang-mobileviclip-2025/ p-MoD: building mixture-of-depths MLLMs via progressive ratio decay https://cs.nju.edu.cn/lm/en/publication/zhang-p-mod-2025/ Tue, 12 Aug 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/zhang-p-mod-2025/ Scalable image tokenization with index backpropagation quantization https://cs.nju.edu.cn/lm/en/publication/shi-scalable-2025/ Tue, 12 Aug 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/shi-scalable-2025/ VRBench: a benchmark for multi-step reasoning in long narrative videos https://cs.nju.edu.cn/lm/en/publication/yu-vrbench-2025/ Tue, 12 Aug 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/yu-vrbench-2025/ Differentiable solver search for fast diffusion sampling https://cs.nju.edu.cn/lm/en/publication/wang-differentiable-2025/ Fri, 18 Jul 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/wang-differentiable-2025/ Elucidating the design space of multimodal protein language models https://cs.nju.edu.cn/lm/en/publication/wang-elucidating-2025/ Fri, 18 Jul 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/wang-elucidating-2025/ ICML 2025 Accepted Papers https://cs.nju.edu.cn/lm/en/post/2025-09-19-icml2025-accepted-papers/ Fri, 18 Jul 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/post/2025-09-19-icml2025-accepted-papers/ <blockquote> <p>ICML is one of the most prestigious and influential conferences in machine learning. It is among the longest-running and largest venues in the field and a CCF Class-A conference.</p> <p>Four papers from the Large Model Center of the Department of Computer Science and Technology, Nanjing University (NJU MCG), have been accepted to ICML 2025.</p></blockquote> <h1 id="01">01</h1> <p><strong>Title:</strong> On the Tension between Byzantine Robustness and No-Attack Accuracy in Distributed Learning</p> <p><strong>Authors:</strong> Yi-Rui Yang, Chang-Wei Shi, Wu-Jun Li</p> <p><strong>Affiliations:</strong> Nanjing University</p> <p><strong>Link:</strong> <a href="https://cs.nju.edu.cn/lwj/paper/ICML2025_NFLinBRDL.pdf" target="_blank" rel="noopener">https://cs.nju.edu.cn/lwj/paper/ICML2025_NFLinBRDL.pdf</a></p> <p><strong>Abstract:</strong></p> <p>Distributed machine learning leverages multiple interconnected devices (nodes) and their data to train models. As datasets and models scale up, large clusters face higher rates of software/hardware failures; in open-network scenarios such as federated learning, adversarial attacks are also more likely. Faulty or malicious nodes are called Byzantine nodes. Byzantine-robust distributed learning often uses robust aggregators to withstand such behavior. However, when no Byzantine nodes are present, the effect of robust aggregation is underexplored. This work theoretically analyzes aggregation error in the no-attack setting and proves that the worst-case aggregation error of a robust aggregator increases with the number of Byzantine nodes it is designed to tolerate—revealing an inherent tension between Byzantine robustness and no-attack accuracy. For both non-convex objectives and those satisfying the Polyak–Łojasiewicz (PL) condition, the paper establishes tight lower bounds on the convergence rate of gradient descent with robust aggregation, reflecting the same trade-off. Experiments substantiate the theory and suggest a practical recipe: use robust aggregation during most epochs to prevent crashes/restarts; near convergence, if the cluster is healthy, switch to standard averaging to further improve accuracy—accelerating training and reducing cost while preserving accuracy. Accepted as Spotlight (top 2.6% of submissions; 9.6% of accepts). <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-01" srcset=" /lm/post/2025-09-19-icml2025-accepted-papers/1_hu_4e61ac29aa13d379.jpg 400w, /lm/post/2025-09-19-icml2025-accepted-papers/1_hu_9e454f7eaa825e0b.jpg 760w, /lm/post/2025-09-19-icml2025-accepted-papers/1_hu_48ccbcb605f24aba.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-09-19-icml2025-accepted-papers/1_hu_4e61ac29aa13d379.jpg" width="760" height="204" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="02">02</h1> <p><strong>Title:</strong> Stochastic Layer-Wise Shuffle for Improving Vision Mamba Training</p> <p><strong>Authors:</strong> Zizheng Huang, Haoxing Chen, Jiaqi Li, Jun Lan, Huijia Zhu, Weiqiang Wang, Limin Wang</p> <p><strong>Affiliations:</strong> Nanjing University; Shanghai Institute of Advanced Innovation; China Mobile Research Institute; Shanghai AI Laboratory</p> <p><strong>Link:</strong> <a href="https://arxiv.org/abs/2408.17081" target="_blank" rel="noopener">https://arxiv.org/abs/2408.17081</a></p> <p><strong>Abstract:</strong></p> <p>Vision Mamba (Vim) offers near-linear computational complexity and strong potential for high-resolution images and long videos, but training—especially at large scales—often suffers from overfitting and complicated pipelines, leaving a gap to leading ViT models on standard benchmarks. This paper proposes Stochastic Layer-Wise Shuffle (SLWS), a plug-and-play regularization method that randomly shuffles each layer’s input token sequence during training with a probability increasing linearly with depth, and restores the original order at output. SLWS encourages deep layers to learn position-invariant high-level semantics, while shallow layers remain sensitive to low-level positional cues. The induced shuffling increases task difficulty as a regularizer, mitigating overfitting. SLWS requires no architectural changes and incurs zero inference overhead. It stabilizes training of large Vim models and yields consistent gains under supervised training. With CLIP-feature–guided masked feature distillation pretraining, Vim-Huge achieves 87.6% fine-tuning accuracy on ImageNet-1K, establishing a new SOTA for Vision Mamba training. <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-02" srcset=" /lm/post/2025-09-19-icml2025-accepted-papers/2_hu_20510d147d3aca5b.jpg 400w, /lm/post/2025-09-19-icml2025-accepted-papers/2_hu_b73a4bb8f1372a03.jpg 760w, /lm/post/2025-09-19-icml2025-accepted-papers/2_hu_31f0ddc24953cb07.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-09-19-icml2025-accepted-papers/2_hu_20510d147d3aca5b.jpg" width="760" height="317" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="03">03</h1> <p><strong>Title:</strong> Elucidating the Design Space of Multimodal Protein Language Models (ICML Spotlight)</p> <p><strong>Authors:</strong> Xinyou Wang*, Cheng-Yen Hsieh*, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, Quanquan Gu</p> <p><strong>Affiliations:</strong> Nanjing University; Rutgers University; ByteDance</p> <p><strong>Link:</strong> <a href="https://arxiv.org/abs/2504.11454" target="_blank" rel="noopener">https://arxiv.org/abs/2504.11454</a></p> <p><strong>Abstract:</strong></p> <p>Proteins are biological macromolecules whose amino-acid sequences fold into specific 3D structures. AI-driven protein modeling and design is a key direction in AI for Science. Following the 2024 Nobel Prize in Chemistry recognizing DeepMind’s AlphaFold for solving the long-standing protein folding problem, AI methods are increasingly used in antibody design, enzyme engineering, and therapeutics. Protein sequences share structural similarity with natural language. Building on this insight, NJU’s NLP group and ByteDance Research have explored generative protein modeling, including DPLM (a general diffusion protein language model, ICML 2024) and DPLM-2 (a multimodal protein base model, ICLR 2025). This work advances that line of research. Code: <a href="https://github.com/bytedance/dplm;" target="_blank" rel="noopener">https://github.com/bytedance/dplm;</a> Project: <a href="https://bytedance.github.io/dplm/" target="_blank" rel="noopener">https://bytedance.github.io/dplm/</a>.</p> <p>Multimodal Protein Language Models (PLMs) jointly model and generate protein sequences and structures. Sequences are modeled with discrete diffusion over amino-acid tokens (as in DPLM). Structures are continuous 3D coordinates that must be discretized into structure tokens for joint modeling. We identify three challenges: (1) discretizing coordinates causes information loss and harms fine-grained structural fidelity; (2) discrete structure tokens under-capture intrinsic correlations of local structure; and (3) insufficient geometric modeling hinders accurate capture of complex 3D residue interactions.</p> <p>We address these by introducing a more precise generative modeling scheme tailored for protein structures to improve prediction accuracy, and by adding explicit geometric supervision via a geometric module with representation alignment to enhance geometric relational modeling. Experiments show strong gains: RMSD on folding drops from 5.52 to 2.36, comparable to ESMFold; in unconditional protein generation, sampling diversity improves by ~30% while maintaining sample quality. <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-03" srcset=" /lm/post/2025-09-19-icml2025-accepted-papers/3_hu_69814230aa645eee.jpg 400w, /lm/post/2025-09-19-icml2025-accepted-papers/3_hu_7b82362eb098dca8.jpg 760w, /lm/post/2025-09-19-icml2025-accepted-papers/3_hu_f8849354112e0008.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-09-19-icml2025-accepted-papers/3_hu_69814230aa645eee.jpg" width="760" height="364" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="04">04</h1> <p><strong>Title:</strong> Differentiable Solver Search for Fast Diffusion Sampling</p> <p><strong>Authors:</strong> Shuai Wang, Zexian Li, Qipeng Zhang, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang</p> <p><strong>Affiliations:</strong> Nanjing University; Alibaba</p> <p><strong>Link:</strong> <a href="https://arxiv.org/abs/2505.21114" target="_blank" rel="noopener">https://arxiv.org/abs/2505.21114</a></p> <p><strong>Abstract:</strong></p> <p>Diffusion models deliver excellent generation quality but with substantial inference cost. Recent ODE-based advanced solvers target lower compute under few sampling steps, yet many are inspired by Adams-type linear multistep methods and rely solely on time-dependent Lagrange interpolation—which may be suboptimal for diffusion dynamics. This paper reveals a compact solver-design search space over time steps and solver coefficients, and proposes a differentiable solver search algorithm to discover superior solvers.</p> <p>With the searched solvers, FlowMatching models SiT-XL/2 and FlowDCN-XL/2 achieve FID 2.40 and 2.35 on ImageNet 256×256 with only 10 steps; the DDPM model DiT-XL/2 reaches FID 2.33 in 10 steps. The discovered solvers substantially outperform traditional solvers (and even some distillation methods) and generalize across architectures, resolutions, and model scales. <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-04" srcset=" /lm/post/2025-09-19-icml2025-accepted-papers/4_hu_b22a437e411fb2ce.jpg 400w, /lm/post/2025-09-19-icml2025-accepted-papers/4_hu_968df2ffd3238664.jpg 760w, /lm/post/2025-09-19-icml2025-accepted-papers/4_hu_16a85f847cb370c.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-09-19-icml2025-accepted-papers/4_hu_b22a437e411fb2ce.jpg" width="760" height="203" loading="lazy" data-zoomable /></div> </div></figure> </p> On the tension between Byzantine robustness and no-attack accuracy in distributed learning https://cs.nju.edu.cn/lm/en/publication/yang-tension-2025/ Fri, 18 Jul 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/yang-tension-2025/ Stochastic layer-wise shuffle for improving Vision Mamba training https://cs.nju.edu.cn/lm/en/publication/huang-stochastic-2025/ Fri, 18 Jul 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/huang-stochastic-2025/ 12 Papers from Nanjing University’s Large Model Center Accepted by CVPR 2025 https://cs.nju.edu.cn/lm/en/post/2025-04-30-cvpr25-accepted/ Wed, 30 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/post/2025-04-30-cvpr25-accepted/ <blockquote> <p><strong>CVPR</strong> (the IEEE/CVF Conference on Computer Vision and Pattern Recognition) is one of the world’s most influential annual academic conferences, covering cutting-edge research in computer vision, pattern recognition, and related fields. Each year it gathers top researchers, scholars, and industry professionals to discuss the latest technological advances and innovative applications. Topics range from image processing and machine learning to 3-D reconstruction and video analysis. All submissions undergo a rigorous peer-review process to ensure originality and academic value. In the 2024 Google Scholar Metrics, CVPR ranked second among all journals and conferences worldwide, just behind <em>Nature</em>.</p> <p>The Large Model Center of the School of Computer Science at Nanjing University has had <strong>12 papers</strong> accepted by CVPR 2025.</p></blockquote> <h1 id="01">01</h1> <p><strong>Title:</strong> UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming</p> <p><strong>Authors:</strong> Hao Lin, Ke Wu, Jie Li, Jun Li, Wu-Jun Li</p> <p><strong>Affiliation:</strong> Nanjing University</p> <p><strong>Link:</strong> <a href="https://arxiv.org/abs/2307.16375" target="_blank" rel="noopener">https://arxiv.org/abs/2307.16375</a></p> <p><strong>Abstract:</strong> Training large models usually demands multi-node, multi-GPU distributed setups. Even with ample hardware, 64 %–87 % of users (in our experiments) fail to obtain results because of sub-optimal hyper-parameters such as how the model and data are partitioned. Moreover, slow training is often tackled by adding GPUs while ignoring the decisive role of distributed algorithms in hardware utilization. Efficient algorithms deliver several-fold speed-ups—and cost cuts—over less efficient ones. Many existing strategies are inefficient and can even slow training as GPU count rises. We present <strong>UniAP</strong>, the first method to jointly optimize intra-layer (e.g., tensor parallelism) and inter-layer (e.g., pipeline parallelism) strategies via automatic search, together with a supporting platform. Given a model and hardware profile, UniAP automatically finds a high-performance scheme, achieving up to 3.8 × speed-up over the best prior work and up to 9 × over unoptimized baselines, while preventing the hyper-parameter mistakes that often cripple runs. UniAP has also been adapted to domestic AI accelerators. The paper was accepted as an <strong>Oral</strong> (0.7 % of submissions, 3.3 % of accepted papers) at CVPR 2025.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-01" srcset=" /lm/post/2025-04-30-cvpr25-accepted/01_hu_e100169f849da6b4.jpg 400w, /lm/post/2025-04-30-cvpr25-accepted/01_hu_640f60e0a6475723.jpg 760w, /lm/post/2025-04-30-cvpr25-accepted/01_hu_c519ad08e5f68f55.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-30-cvpr25-accepted/01_hu_e100169f849da6b4.jpg" width="715" height="244" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="02">02</h1> <p><strong>Title:</strong> Balanced Direction from Multifarious Choices: Arithmetic Meta-Learning for Domain Generalization</p> <p><strong>Authors:</strong> Xiran Wang, Jian Zhang, Lei Qi, Yinghuan Shi</p> <p><strong>Affiliation:</strong> Nanjing University; Southeast University</p> <p><strong>Link:</strong> <a href="https://arxiv.org/abs/2503.18987" target="_blank" rel="noopener">https://arxiv.org/abs/2503.18987</a></p> <p><strong>Abstract:</strong> Domain generalization tackles distribution shifts between source (training) and unseen target (test) domains. First-order meta-learning based on gradient alignment finds balanced parameters across multiple sources, mitigating over-fitting. We reveal that gradient-aligned paths are not unique and that existing methods explore only one. Furthermore, they focus on directional alignment but ignore where in parameter space the model converges; ideally, the solution should lie near the centroid of each source optimum. We propose <strong>Arithmetic Meta-Learning (Arith)</strong>, which introduces parameter averaging into meta-learning and designs an arithmetic-gradient optimizer that approximates the centroid while preserving gradient direction. Arith needs no extra expert networks or explicit regularizers and achieves strong generalization across benchmarks.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-02" srcset=" /lm/post/2025-04-30-cvpr25-accepted/02_hu_25cd3c75baa28be7.jpg 400w, /lm/post/2025-04-30-cvpr25-accepted/02_hu_970cb5f3f06cca6b.jpg 760w, /lm/post/2025-04-30-cvpr25-accepted/02_hu_b4048c24b1b1dfb7.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-30-cvpr25-accepted/02_hu_25cd3c75baa28be7.jpg" width="760" height="142" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="03">03</h1> <p><strong>Title:</strong> Steady Progress Beats Stagnation: Mutual Aid of Foundation and Conventional Models in Mixed-Domain Semi-Supervised Medical Image Segmentation</p> <p><strong>Authors:</strong> Qinghe Ma, Jian Zhang, Zekun Li, Qian Yu, Lei Qi, Yinghuan Shi</p> <p><strong>Affiliation:</strong> Nanjing University; Southeast University</p> <p><strong>Link:</strong> <a href="https://arxiv.org/abs/2503.16997" target="_blank" rel="noopener">https://arxiv.org/abs/2503.16997</a></p> <p><strong>Abstract:</strong> Large-scale pretrained vision foundation models show impressive generality, yet their rich priors can be a double-edged sword when adapted to specialized tasks. In medical-image segmentation with domain mismatch, foundation models such as MedSAM often yield over-confident but erroneous predictions, hampering leverage of unlabeled data. We introduce <strong>SynFoC</strong>, a framework that co-trains a foundation model with a from-scratch conventional model. The latter corrects high-confidence errors of the former, while the former supplies high-quality pseudo-labels early on. A Self-Mutual Confidence (SMC) module assesses pseudo-label quality and adaptively fuses them; a consensus–disagreement consistency constraint further boosts collaboration. Experiments confirm superior performance over existing approaches.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-03" srcset=" /lm/post/2025-04-30-cvpr25-accepted/03_hu_f53d83166b7ec020.jpg 400w, /lm/post/2025-04-30-cvpr25-accepted/03_hu_58afb2d2924d8bbf.jpg 760w, /lm/post/2025-04-30-cvpr25-accepted/03_hu_aaf053b28554ac74.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-30-cvpr25-accepted/03_hu_f53d83166b7ec020.jpg" width="760" height="431" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="04">04</h1> <p><strong>Title:</strong> Taste More, Taste Better: Diverse Data and Strong Model Boost Semi-Supervised Crowd Counting</p> <p><strong>Authors:</strong> Maochen Yang, Zekun Li, Jian Zhang, Lei Qi, Yinghuan Shi</p> <p><strong>Affiliation:</strong> Nanjing University; Southeast University</p> <p><strong>Link:</strong> <a href="https://arxiv.org/abs/2503.17984" target="_blank" rel="noopener">https://arxiv.org/abs/2503.17984</a></p> <p><strong>Abstract:</strong> Crowd counting is vital in smart-city and public-safety applications, yet dense annotation is costly. Semi-supervised counting aims to exploit unlabeled data, but effective use remains challenging. We propose <strong>TMTB (Taste More Taste Better)</strong>, advancing both <em>data</em> and <em>model</em> aspects. (1) <strong>Inpainting Augmentation</strong> uses diffusion models to regenerate image backgrounds without disturbing crowd structures, greatly enriching data diversity; unreliable regions are filtered. (2) <strong>Visual State Space Model (VSSM)</strong> serves as the backbone, capturing global context with linear complexity—ideal for extreme density, low light, or bad weather. (3) A noise-robust classification head supplies coarse-but-stable interval-count supervision, mitigating regression sensitivity to label noise. On multiple datasets, TMTB outperforms state-of-the-art methods under 5 %, 10 %, and 40 % label fractions; on JHU-Crowd++ with only 5 % labels it lowers MAE below 70 for the first time (67.0) and shows strong cross-domain generalization.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-04" srcset=" /lm/post/2025-04-30-cvpr25-accepted/04_hu_278f838391606089.jpg 400w, /lm/post/2025-04-30-cvpr25-accepted/04_hu_cdf396fd430f27ef.jpg 760w, /lm/post/2025-04-30-cvpr25-accepted/04_hu_d3e571302f9cee35.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-30-cvpr25-accepted/04_hu_278f838391606089.jpg" width="760" height="307" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="05">05</h1> <p><strong>Title:</strong> AutoLUT: LUT-Based Image Super-Resolution with Automatic Sampling and Adaptive Residual Learning</p> <p><strong>Authors:</strong> Yuheng Xu, Shijie Yang, Xin Liu, Jie Liu, Jie Tang, Gangshan Wu</p> <p><strong>Affiliation:</strong> Nanjing University</p> <p><strong>Link:</strong> <a href="https://arxiv.org/abs/2503.01565" target="_blank" rel="noopener">https://arxiv.org/abs/2503.01565</a></p> <p><strong>Abstract:</strong> The spread of high-DPI displays heightens demand for high-def images, yet edge devices struggle to host heavy SR networks, calling for efficiency. Prior LUT-based SR has scarcely mined pixel-level cues and uses fixed sampling, limiting accuracy and fine-detail capture. We introduce two plug-and-play modules: <strong>AutoSample</strong>, which learns flexible LUT sampling weights during training—adapting to pixel variations, enlarging receptive field, and incurring no inference overhead—and <strong>AdaRL</strong>, which strengthens inter-layer connections to boost fine-detail reconstruction. With similar storage, AutoLUT lifts MuLUT by ≈ 0.20 dB PSNR across five datasets; on SPF-LUT it halves storage, cuts inference time by two-thirds, and maintains fidelity.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-05-1" srcset=" /lm/post/2025-04-30-cvpr25-accepted/05_1_hu_df8e5326d4da1621.jpg 400w, /lm/post/2025-04-30-cvpr25-accepted/05_1_hu_cfb5a5f08b60986.jpg 760w, /lm/post/2025-04-30-cvpr25-accepted/05_1_hu_f4296c559f2e85a7.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-30-cvpr25-accepted/05_1_hu_df8e5326d4da1621.jpg" width="760" height="492" loading="lazy" data-zoomable /></div> </div></figure> <br> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-05-2" srcset=" /lm/post/2025-04-30-cvpr25-accepted/05_2_hu_3f8d391d3fe2d009.jpg 400w, /lm/post/2025-04-30-cvpr25-accepted/05_2_hu_24b186745db08aac.jpg 760w, /lm/post/2025-04-30-cvpr25-accepted/05_2_hu_da094309db3bb924.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-30-cvpr25-accepted/05_2_hu_3f8d391d3fe2d009.jpg" width="760" height="539" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="06">06</h1> <p><strong>Title:</strong> CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution</p> <p><strong>Authors:</strong> Xin Liu, Jie Liu, Jie Tang, Gangshan Wu</p> <p><strong>Affiliation:</strong> Nanjing University</p> <p><strong>Link:</strong> <a href="https://arxiv.org/abs/2503.06896" target="_blank" rel="noopener">https://arxiv.org/abs/2503.06896</a></p> <p><strong>Abstract:</strong> Transformer-based SR excels on low-level vision but its quadratic complexity explodes with resolution. Existing speed-ups partition images into content-agnostic windows, curtailing long-range redundancy exploitation vital for SR. We propose <strong>CATANet</strong>, a lightweight Content-Aware Token Aggregation Network. A novel aggregation module clusters content-similar tokens across the entire image, sharing aggregation centers and updating them only during training to cut computation. We then apply intra-group self-attention for long-range interaction and inter-group cross-attention to enhance global fusion. Compared with the clustering-based SPIN, CATANet is faster at inference while gaining up to 0.33 dB PSNR.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-06" srcset=" /lm/post/2025-04-30-cvpr25-accepted/06_hu_f0d9dff8885c37a9.jpg 400w, /lm/post/2025-04-30-cvpr25-accepted/06_hu_31eb83525db6757a.jpg 760w, /lm/post/2025-04-30-cvpr25-accepted/06_hu_f908f4f2e8636b12.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-30-cvpr25-accepted/06_hu_f0d9dff8885c37a9.jpg" width="760" height="257" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="07">07</h1> <p><strong>Title:</strong> Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning</p> <p><strong>Authors:</strong> Jiange Yang, Haoyi Zhu, Yating Wang, Gangshan Wu, Tong He, Limin Wang</p> <p><strong>Affiliation:</strong> Nanjing University; Shanghai AI Lab; USTC; Tongji University</p> <p><strong>Link:</strong> <a href="https://arxiv.org/pdf/2411.14519" target="_blank" rel="noopener">https://arxiv.org/pdf/2411.14519</a></p> <p><strong>Abstract:</strong> Data scarcity and heterogeneity challenge robot learning. <strong>Tra-MoE</strong> adopts a sparsely gated Mixture-of-Experts to learn trajectory prediction from large-scale cross-domain video without action labels, balancing parameter sharing and specialization. It fuses simulation videos rendered by different physics engines with real videos of humans, single-arm, and dual-arm robots—promising for cross-agent learning. An adaptive policy-conditioning mechanism leverages predicted trajectories to boost downstream robot control, greatly reducing needs for expensive real-robot data.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-07" srcset=" /lm/post/2025-04-30-cvpr25-accepted/07_hu_690021f7f420773b.jpg 400w, /lm/post/2025-04-30-cvpr25-accepted/07_hu_9ef725b0289d4609.jpg 760w, /lm/post/2025-04-30-cvpr25-accepted/07_hu_34bb4380e8a85264.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-30-cvpr25-accepted/07_hu_690021f7f420773b.jpg" width="760" height="632" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="08">08</h1> <p><strong>Title:</strong> LeviTor: 3-D Trajectory Oriented Image-to-Video Synthesis</p> <p><strong>Authors:</strong> Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, Limin Wang</p> <p><strong>Affiliation:</strong> Nanjing University; Ant Group; Zhejiang University; Hong Kong University of Science and Technology; Shanghai AI Lab</p> <p><strong>Link:</strong> <a href="https://github.com/ant-research/LeviTor" target="_blank" rel="noopener">https://github.com/ant-research/LeviTor</a></p> <p><strong>Abstract:</strong> Sketching a trajectory is an intuitive way to control motion in image-to-video synthesis, yet 2-D paths are ambiguous for out-of-plane motion. <strong>LeviTor</strong> enriches interaction by adding a <strong>depth</strong> dimension: users assign relative depth to trajectory key-points, retaining 2-D convenience while enabling 3-D control. Objects are represented by a few cluster points reflecting depth and occlusion. These, along with depth and instance maps, guide a video-diffusion generator to produce videos faithfully following 3-D trajectories. Extensive experiments demonstrate precise motion control and high realism.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-08" srcset=" /lm/post/2025-04-30-cvpr25-accepted/08_hu_1751dbef0790baa6.jpg 400w, /lm/post/2025-04-30-cvpr25-accepted/08_hu_fdaded7387e23b3c.jpg 760w, /lm/post/2025-04-30-cvpr25-accepted/08_hu_67565fc74801c8df.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-30-cvpr25-accepted/08_hu_1751dbef0790baa6.jpg" width="760" height="346" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="09">09</h1> <p><strong>Title:</strong> Contextual AD Narration with Interleaved Multimodal Sequence</p> <p><strong>Authors:</strong> Hanlin Wang, Zhan Tong, Kecheng Zheng, Yujun Shen, Limin Wang</p> <p><strong>Affiliation:</strong> Nanjing University; KU Leuven; Ant Group; Shanghai AI Lab</p> <p><strong>Link:</strong> <a href="https://arxiv.org/abs/2403.12922" target="_blank" rel="noopener">https://arxiv.org/abs/2403.12922</a></p> <p><strong>Abstract:</strong> Audio description (AD) narrates visual content for the visually impaired. We present <strong>Uni-AD</strong>, a simple unified framework that feeds interleaved multimodal sequences—video features, text, character lists, and context—into a pretrained language model. A lightweight mapper aligns video to text space for fine-grained fusion; a character-optimization module highlights major roles in context. Coupled with context cues and a contrastive loss, Uni-AD generates fluent, context-aware narration. Experiments on multiple AD datasets confirm its superiority.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-09" srcset=" /lm/post/2025-04-30-cvpr25-accepted/09_hu_fb61f5341a939d52.jpg 400w, /lm/post/2025-04-30-cvpr25-accepted/09_hu_d2db3b4868bbc9f0.jpg 760w, /lm/post/2025-04-30-cvpr25-accepted/09_hu_c1ccb64f2076ede5.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-30-cvpr25-accepted/09_hu_fb61f5341a939d52.jpg" width="760" height="504" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="10">10</h1> <p><strong>Title:</strong> Multiple Object Tracking as ID Prediction</p> <p><strong>Authors:</strong> Ruopeng Gao, Ji Qi, Limin Wang</p> <p><strong>Affiliation:</strong> Nanjing University; China Mobile (Jiangsu) Software Technology Co.; Shanghai AI Lab</p> <p><strong>Link:</strong> <a href="https://github.com/MCG-NJU/MOTIP" target="_blank" rel="noopener">https://github.com/MCG-NJU/MOTIP</a></p> <p><strong>Abstract:</strong> Multi-object tracking (MOT) is traditionally decomposed into detection and association, with handcrafted algorithms maintaining trajectories and computing cost matrices—effective yet requiring extensive tuning for complex scenes. We reconceptualize MOT as <strong>context-conditioned ID prediction</strong> and propose <strong>MOTIP</strong>, an end-to-end framework that directly decodes ID labels for current detections given past trajectories. Using only appearance features, MOTIP achieves state-of-the-art results on multiple benchmarks without elaborate tricks, offering a powerful baseline for future research.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-10" srcset=" /lm/post/2025-04-30-cvpr25-accepted/10_hu_9a57f170c0398507.jpg 400w, /lm/post/2025-04-30-cvpr25-accepted/10_hu_e7dbbf8506bf5ad1.jpg 760w, /lm/post/2025-04-30-cvpr25-accepted/10_hu_e955a05bbb16460f.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-30-cvpr25-accepted/10_hu_9a57f170c0398507.jpg" width="760" height="297" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="11">11</h1> <p><strong>Title:</strong> Online Video Understanding: OVBench and VideoChat-Online</p> <p><strong>Authors:</strong> Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, Limin Wang</p> <p><strong>Project Site:</strong> <a href="https://videochat-online.github.io/" target="_blank" rel="noopener">https://videochat-online.github.io/</a></p> <p><strong>Affiliation:</strong> Nanjing University; China Mobile Research Institute; Shanghai AI Lab</p> <p><strong>Abstract:</strong> Multimodal large language models have excelled at <em>offline</em> video understanding, but real-time scenarios (e.g., autonomous driving, HCI) pose fresh challenges. We contribute on three fronts: <strong>(1) OVBench</strong>, a comprehensive QA benchmark evaluating perception, memory, and reasoning over <em>streaming</em> video, spanning six task types across past, current, and future contexts (16 subtasks from diverse datasets). <strong>(2) Pyramid Memory Bank</strong>, which efficiently retains critical spatio-temporal cues. <strong>(3) An offline-to-online learning paradigm</strong>, with an alternating dialog format and the <strong>VideoChatOnline-IT</strong> instruction-tuning set for streaming data. Our resulting framework, <strong>VideoChat-Online</strong>, outperforms state-of-the-art offline and online models on common offline benchmarks and OVBench, despite lower compute cost.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-11-1" srcset=" /lm/post/2025-04-30-cvpr25-accepted/11_1_hu_94babecca7def9bb.jpg 400w, /lm/post/2025-04-30-cvpr25-accepted/11_1_hu_55fabf84f53a00bd.jpg 760w, /lm/post/2025-04-30-cvpr25-accepted/11_1_hu_8e3fe13bf05610a0.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-30-cvpr25-accepted/11_1_hu_94babecca7def9bb.jpg" width="609" height="371" loading="lazy" data-zoomable /></div> </div></figure> <br> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-11-2" srcset=" /lm/post/2025-04-30-cvpr25-accepted/11_2_hu_fde002e05d04aee9.jpg 400w, /lm/post/2025-04-30-cvpr25-accepted/11_2_hu_730596cd8c4630d4.jpg 760w, /lm/post/2025-04-30-cvpr25-accepted/11_2_hu_f28beafc197c48a9.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-30-cvpr25-accepted/11_2_hu_fde002e05d04aee9.jpg" width="760" height="346" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="12">12</h1> <p><strong>Title:</strong> Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment</p> <p><strong>Authors:</strong> Zi&rsquo;ang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, Yi Wang</p> <p><strong>Affiliation:</strong> Shanghai AI Lab; Zhejiang University; University of Science and Technology of China; Shanghai Jiao Tong University; Shenzhen Institutes of Advanced Technology, CAS; Nanjing University</p> <p><strong>Abstract:</strong> Although multimodal LLMs excel at broad visual reasoning, they lag on fine-grained or high-precision tasks. Prior efforts either add tool-usage skills or fold specific vision tasks into the autoregressive framework, often harming overall multimodal performance. We propose <strong>Task Preference Optimization (TPO)</strong>, which introduces differentiable task <em>preferences</em> distilled from fine-grained vision tasks to guide optimization. Learnable <strong>task tokens</strong> form dynamic links between multiple task-specific heads and the core MLLM, enabling effective use of rich labeled data. TPO supports joint multi-task training, boosting overall performance by <strong>14.6 %</strong> versus baselines and delivering strong zero-shot generalization comparable to fully-supervised state-of-the-art models. We instantiate TPO on <strong>VideoChat</strong> and <strong>LLaVA</strong>, confirming significant gains and opening a scalable pathway to enhance MLLMs on diverse visual tasks.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-12" srcset=" /lm/post/2025-04-30-cvpr25-accepted/12_hu_4c6e69a2ee907fe2.jpg 400w, /lm/post/2025-04-30-cvpr25-accepted/12_hu_81cb1bf782f4230.jpg 760w, /lm/post/2025-04-30-cvpr25-accepted/12_hu_1c13f9b0e7f5944c.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-30-cvpr25-accepted/12_hu_4c6e69a2ee907fe2.jpg" width="760" height="298" loading="lazy" data-zoomable /></div> </div></figure> </p> <p><a href="https://mp.weixin.qq.com/s/RKXp_7lzeO9Ad7axKbcShw" target="_blank" rel="noopener">Read Original</a></p> AutoLUT: LUT-based image super-resolution with automatic sampling and adaptive residual learning https://cs.nju.edu.cn/lm/en/publication/xu-autolut-2025/ Sun, 20 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/xu-autolut-2025/ Balanced direction from multifarious choices: arithmetic meta-learning for domain generalization https://cs.nju.edu.cn/lm/en/publication/wang-balanced-2025/ Sun, 20 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/wang-balanced-2025/ CATANet: efficient content-aware token aggregation for lightweight image super-resolution https://cs.nju.edu.cn/lm/en/publication/liu-catanet-2025/ Sun, 20 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/liu-catanet-2025/ Contextual AD narration with interleaved multimodal sequence https://cs.nju.edu.cn/lm/en/publication/wang-contextual-2025/ Sun, 20 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/wang-contextual-2025/ LeviTor: 3D trajectory oriented image-to-video synthesis https://cs.nju.edu.cn/lm/en/publication/wang-levitor-2025/ Sun, 20 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/wang-levitor-2025/ Multiple object tracking as id prediction https://cs.nju.edu.cn/lm/en/publication/gao-multiple-2025/ Sun, 20 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/gao-multiple-2025/ Online video understanding: a comprehensive benchmark and memory-augmented method https://cs.nju.edu.cn/lm/en/publication/huang-online-2025/ Sun, 20 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/huang-online-2025/ Steady progress beats stagnation: mutual aid of foundation and conventional models in mixed domain semi-supervised medical image segmentation https://cs.nju.edu.cn/lm/en/publication/ma-steady-2025/ Sun, 20 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/ma-steady-2025/ Task preference optimization: improving multimodal large language models with vision task alignment https://cs.nju.edu.cn/lm/en/publication/yan-task-2025/ Sun, 20 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/yan-task-2025/ Taste more, taste better: diverse data and strong model boost semi-supervised crowd counting https://cs.nju.edu.cn/lm/en/publication/yang-taste-2025/ Sun, 20 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/yang-taste-2025/ Tra-MoE: learning trajectory prediction model from multiple domains for adaptive policy conditioning https://cs.nju.edu.cn/lm/en/publication/yang-tra-moe-2025/ Sun, 20 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/yang-tra-moe-2025/ Uniap: unifying inter-and intra-layer automatic parallelism by mixed integer quadratic programming https://cs.nju.edu.cn/lm/en/publication/lin-uniap-2025/ Sun, 20 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/lin-uniap-2025/ CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding https://cs.nju.edu.cn/lm/en/publication/chen-cg-bench-2025/ Tue, 15 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/chen-cg-bench-2025/ Five Papers from Nanjing University’s School of Computer Science Large Model Innovation Center Accepted at ICLR 2025 https://cs.nju.edu.cn/lm/en/post/2025-04-15-iclr25-accepted/ Tue, 15 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/post/2025-04-15-iclr25-accepted/ <blockquote> <p>ICLR (International Conference on Learning Representations) is one of the leading AI conferences focusing on deep learning and representation learning. Since its inception in 2013, ICLR has become a premier platform for machine learning research, particularly in deep learning, neural architectures, reinforcement learning, generative models, and NLP.</p> <p>Five papers from the Large Model Innovation Center of Nanjing University’s School of Computer Science were accepted at ICLR 2025.</p></blockquote> <h1 id="01">01</h1> <p><strong>Title:</strong> TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning<br> <strong>Authors:</strong> Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, Yali Wang, Yu Qiao, Limin Wang<br> <strong>Affiliations:</strong> Nanjing University, Shanghai AI Laboratory, Chinese Academy of Sciences, etc.<br> <strong>Link:</strong> <a href="https://openreview.net/forum?id=nAVejJURqZ" target="_blank" rel="noopener">https://openreview.net/forum?id=nAVejJURqZ</a><br> <strong>Abstract:</strong> Most existing video multimodal large models tend to focus on irrelevant segments when understanding long videos, often leading to hallucinations. Can we enhance MLLMs’ long-video QA performance by using temporal localization as an auxiliary task to pinpoint relevant subsegments? We propose TimeSuite, which incrementally fine-tunes short-video MLLMs with time-location data to boost long-video understanding. TimeSuite includes: a simple, efficient long-video framework (VideoChat‑T); a high‑quality localization‑based instruction tuning dataset (TimePro); and a tailored instruction task (Temporal Grounded Caption). Joint tuning guides MLLMs to focus on correct segments, improving QA accuracy. First, VideoChat‑T achieves expert‑level temporal localization without external decoders while retaining strong QA generalization and zero‑shot ability. Second, integrating the expert task enhances comprehensive long‑video understanding, validating this hybrid approach. Experiments show VideoChat‑T yields 5.6% and 6.8% accuracy gains on Egoschema and VideoMME, respectively, and demonstrates superior zero‑shot localization, matching supervised expert models after fine‑tuning.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-01" srcset=" /lm/post/2025-04-15-iclr25-accepted/01_hu_b3d12c09cbdd7fad.jpg 400w, /lm/post/2025-04-15-iclr25-accepted/01_hu_d362119f7329d1c0.jpg 760w, /lm/post/2025-04-15-iclr25-accepted/01_hu_febcbeb47fa1401e.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-15-iclr25-accepted/01_hu_b3d12c09cbdd7fad.jpg" width="760" height="347" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="02">02</h1> <p><strong>Title:</strong> CG-Bench: Clue‑grounded Question Answering Benchmark for Long Video Understanding<br> <strong>Authors:</strong> Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, Limin Wang<br> <strong>Affiliations:</strong> Nanjing University, Shanghai AI Laboratory, Fudan University, Zhejiang University<br> <strong>Link:</strong> <a href="https://openreview.net/forum?id=le4IoZZHy1" target="_blank" rel="noopener">https://openreview.net/forum?id=le4IoZZHy1</a><br> <strong>Abstract:</strong> We introduce CG‑Bench, a benchmark for long‑video multimodal reasoning using a “Clue‑Question‑Answer” triplet. Unlike multiple‑choice tests, models must answer correctly and accurately locate supporting video segments. CG‑Bench offers three tasks: perception (basic visual skills), reasoning (temporal &amp; multimodal integration), and hallucination detection (reliability under ambiguity). It uses dual evaluation: white‑box IoU for localization precision and black‑box Clue Recovery Rate for context dilution. Combining multiple‑choice and open‑ended forms with human annotations and heuristic rules, CG‑Bench ensures evaluation quality. The dataset contains 1,219 long videos across 638 subcategories, totaling 12,129 QA pairs. Results show models like GPT‑4o perform well on multiple choice but drop sharply when localization is required (white‑box acc@IoU only 4.38%, open‑ended accuracy &lt;40%). Performance varies with video length, frame sampling, and multimodal cues, highlighting challenges in precise information retrieval for long‑video reasoning.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-02" srcset=" /lm/post/2025-04-15-iclr25-accepted/02_hu_4b6007ea5ded8cb1.jpg 400w, /lm/post/2025-04-15-iclr25-accepted/02_hu_ae5eb409b9a89a60.jpg 760w, /lm/post/2025-04-15-iclr25-accepted/02_hu_de5d2148b66f69c8.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-15-iclr25-accepted/02_hu_4b6007ea5ded8cb1.jpg" width="760" height="319" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="03">03</h1> <p><strong>Title:</strong> SPA: 3D Spatial‑Awareness Enables Effective Embodied Representation<br> <strong>Authors:</strong> Haoyi Zhu, Honghui Yang, Yating Wang, Jiange Yang, Limin Wang, Tong He<br> <strong>Affiliations:</strong> University of Science and Technology of China, Shanghai AI Laboratory, Zhejiang University, Tongji University, Nanjing University<br> <strong>Link:</strong> <a href="https://openreview.net/forum?id=6TLdqAZgzn" target="_blank" rel="noopener">https://openreview.net/forum?id=6TLdqAZgzn</a><br> <strong>Abstract:</strong> Spatial awareness is critical for robots in complex environments, but existing methods struggle to capture 3D geometry. We propose SPA, a visual representation framework that enhances 3D spatial awareness for embodied tasks. SPA trains on a large multi‑view dataset with camera poses, depth, and semantic maps from synthetic and real robot scenes. It builds volumetric features from multi‑view input, uses mask‑based differentiable neural rendering to generate RGB, depth, and semantic maps, and applies Eikonal regularization with SDF supervision for geometric consistency. After 6,000 GPU hours, SPA outperforms baselines on 200+ tasks across real and eight simulated environments, ranking first in 30.3% of tasks.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-03-1" srcset=" /lm/post/2025-04-15-iclr25-accepted/03_1_hu_cfc9b6bcd7e1d394.jpg 400w, /lm/post/2025-04-15-iclr25-accepted/03_1_hu_c4da3db7a5ab7bf1.jpg 760w, /lm/post/2025-04-15-iclr25-accepted/03_1_hu_eb043d5946813c7d.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-15-iclr25-accepted/03_1_hu_cfc9b6bcd7e1d394.jpg" width="760" height="392" loading="lazy" data-zoomable /></div> </div></figure> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-03-2" srcset=" /lm/post/2025-04-15-iclr25-accepted/03_2_hu_743d4620733f098e.jpg 400w, /lm/post/2025-04-15-iclr25-accepted/03_2_hu_6fd6df0797ddf931.jpg 760w, /lm/post/2025-04-15-iclr25-accepted/03_2_hu_c5163664956ee2cc.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-15-iclr25-accepted/03_2_hu_743d4620733f098e.jpg" width="760" height="383" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="04">04</h1> <p><strong>Title:</strong> Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning<br> <strong>Authors:</strong> Zenan Li, Zhaoyu Li, Wen Tang, Xian Zhang, Yuan Yao, Xujie Si, Fan Yang, Kaiyu Yang, Xiaoxing Ma<br> <strong>Affiliations:</strong> Nanjing University, University of Toronto, Microsoft Research Asia, Peking University, Meta<br> <strong>Link:</strong> <a href="https://openreview.net/forum?id=FiyS0ecSm0" target="_blank" rel="noopener">https://openreview.net/forum?id=FiyS0ecSm0</a><br> <strong>Abstract:</strong> AI has advanced in competition‑level proofs, especially inequalities, which pose huge search spaces at each step. We present a neural‑symbolic system that integrates neural networks with symbolic reasoning, excelling on Olympiad‑level inequality tasks. On a standard set of 20 problems, our system solves 16 on average (versus 15 by human gold medalists), outperforming GPT and DeepSeek. This breakthrough showcases neural‑symbolic methods’ potential for complex mathematical reasoning, opening new avenues in automated theorem proving, education, and research.</p> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="img-04" srcset=" /lm/post/2025-04-15-iclr25-accepted/04_hu_38ab42c5a7a94fbf.jpg 400w, /lm/post/2025-04-15-iclr25-accepted/04_hu_445968fae589ebc6.jpg 760w, /lm/post/2025-04-15-iclr25-accepted/04_hu_f6eef80a78281444.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-04-15-iclr25-accepted/04_hu_38ab42c5a7a94fbf.jpg" width="760" height="410" loading="lazy" data-zoomable /></div> </div></figure> </p> <h1 id="05">05</h1> <p><strong>Title:</strong> MeteoRA: Multiple‑tasks Embedded LoRA for Large Language Models<br> <strong>Authors:</strong> Jingwei Xu, Junyu Lai, Yunpeng Huang<br> <strong>Affiliations:</strong> Nanjing University<br> <strong>Link:</strong> <a href="https://openreview.net/pdf?id=yOOJwR15xg" target="_blank" rel="noopener">https://openreview.net/pdf?id=yOOJwR15xg</a><br> <strong>Abstract:</strong> The “pretrain + finetune” paradigm underpins LLM deployment, with LoRA as a popular efficient fine‑tuning method. Yet task awareness and adapter switching remain challenging with multiple LoRA adapters. We propose MeteoRA, a scalable multi‑task LoRA architecture embedding task‑specific adapters and a routing component via a Mixture‑of‑Experts (MoE) design for adaptive adapter selection. A hybrid expert model acceleration strategy leverages PyTorch and Triton–based custom operators to avoid MoE routing loops, achieving 4× speedup. Experiments demonstrate MeteoRA’s effectiveness on composite tasks, handling up to ten serial questions per inference and showing clear routing biases, confirming adaptive switching.</p> <p><a href="https://mp.weixin.qq.com/s/iG5D5n4EXy1MrG4riTt1ag" target="_blank">View original</a></p> MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models https://cs.nju.edu.cn/lm/en/publication/xu-meteora-2025/ Tue, 15 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/xu-meteora-2025/ Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning https://cs.nju.edu.cn/lm/en/publication/li-proving-2025/ Tue, 15 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/li-proving-2025/ SPA: 3D Spatial-Awareness Enables Effective Embodied Representation https://cs.nju.edu.cn/lm/en/publication/zhu-spa-2025/ Tue, 15 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/zhu-spa-2025/ TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning https://cs.nju.edu.cn/lm/en/publication/zeng-timesuite-2025/ Tue, 15 Apr 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/publication/zeng-timesuite-2025/ Shusheng InternVideo2.5 Open-Sourced, Precisely Finding the 'Needle in a Haystack' in Tens of Thousands of Frames, with Fine-Grained Spatiotemporal Perception https://cs.nju.edu.cn/lm/en/post/2025-02-11-internvideo-25-release/ Tue, 11 Feb 2025 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/post/2025-02-11-internvideo-25-release/ <blockquote> <p>Recently, the Shanghai AI Lab, in collaboration with Nanjing University and the Shenzhen Institutes of Advanced Technology, jointly open-sourced the multi-modal video model Shusheng InternVideo2.5. In the field of video understanding, the upgraded InternVideo2.5 has achieved improvements in both temporal span and fine granularity, expanding its capacity sixfold compared to the previous model. It enables a precise &ldquo;needle in a haystack&rdquo; search within long videos containing tens of thousands of frames, allowing AI to more accurately interpret the complex real world and infuse new quality into various applications. Previously, the Shusheng InternVideo series was applied during the live broadcast of the Paris Olympics by China Central Television, precisely pinpointing athletes&rsquo; scoring moments and corresponding slow-motion replays, significantly enhancing TV production efficiency. With enhanced long video processing capabilities, InternVideo2.5 will offer more efficient AI support for applications such as autonomous driving, security surveillance, and virtual reality.</p></blockquote> <p>Open source link: <a href="https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5"><a href="https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5" target="_blank" rel="noopener">https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5</a></a> <br>Paper link: <a href="https://arxiv.org/abs/2501.12386"><a href="https://arxiv.org/abs/2501.12386" target="_blank" rel="noopener">https://arxiv.org/abs/2501.12386</a></a> <br>Huggingface link: <a href="https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B"><a href="https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B" target="_blank" rel="noopener">https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B</a></a></p> <video controls poster="/lm/post/2025-02-11-internvideo-25-release/cover.jpg" > <source src="https://cs.nju.edu.cn/lm/post/2025-02-11-internvideo-25-release/InternVideo2.5_demo.mp4" type="video/mp4"> </video> <h3 id="focus-on-fine-grained-spatiotemporal-understanding-and-efficient-long-video-processing">Focus on Fine-Grained Spatiotemporal Understanding and Efficient Long Video Processing</h3> <p>Shanghai AI Lab has continuously invested in video multi-modal large model (Video MLLM) technology since 2022, successively launching and open-sourcing the general video foundation model Shusheng InternVideo, the video understanding large model Shusheng InternVideo2, and the dialogue-centric video understanding paradigm VideoChat. By leveraging its experience in video visual representation learning and multi-modal dialogue, the upgraded InternVideo2.5 focuses on fine spatiotemporal understanding through deep integration of visual perception and language comprehension, achieving breakthroughs in long video understanding.</p> <p><strong>InternVideo2.5 Capability Characteristics:</strong></p> <ul> <li>Ultra-long video processing: Accurately locate targets within tens of thousands of frames, with processing length extended from 3,000 to 10,000 frames.</li> <li>Fine-grained perception: Accurately identify and locate objects, scenes, and actions while comprehending subtle spatiotemporal relationships.</li> <li>Integration of multiple visual capabilities: Not only supports general video Q&amp;A but also proficiently handles specialized tasks such as object tracking and segmentation.</li> </ul> <div class="img-full-width"> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="image" srcset=" /lm/post/2025-02-11-internvideo-25-release/figure_hu_a50f22fad0dc27e5.jpg 400w, /lm/post/2025-02-11-internvideo-25-release/figure_hu_edbee0977f0f6f4f.jpg 760w, /lm/post/2025-02-11-internvideo-25-release/figure_hu_db8b68f7be8654cc.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2025-02-11-internvideo-25-release/figure_hu_a50f22fad0dc27e5.jpg" width="760" height="217" loading="lazy" data-zoomable /></div> </div></figure> </p> </div> <p><span style="font-size: 0.8em; line-height: 0.2; color: rgb(136, 136, 136);">Left image: Performance comparison between InternVideo2.5 and other 8-billion-parameter open models on MVBench and VideoMME; Right image: InternVideo2.5 accurately tracks and analyzes videos.</span></p> <h3 id="lrc-combined-with-progressive-training-to-overcome-bottlenecks-in-long-video-modeling">LRC Combined with Progressive Training to Overcome Bottlenecks in Long Video Modeling</h3> <p>For long videos and fine-grained visual tasks, traditional video multi-modal large models face significant challenges in accurately tracking target objects in ultra-long videos or recognizing subtle spatiotemporal relationships in complex scenes. For example, in &ldquo;needle in a haystack&rdquo; tasks, conventional methods require extensive computational resources and deliver unsatisfactory localization accuracy, thereby limiting industrial applications. To address this, Shanghai AI Lab, together with its research team, leveraged its self-developed Shusheng InternVL2.5 base model to propose Long-range Context Modeling (LRC) technology as a solution.</p> <p><strong>The Two Core Modules of Long-range Context Modeling (LRC) Technology:</strong></p> <ul> <li> <p>Hierarchical Context Compression (HiCo): Exploits redundancy in long video visual data through layered compression. Experimental results demonstrate that with HiCo, InternVideo2.5 can accurately locate target frames within tens of thousands of frames, leading in performance among open models.</p> </li> <li> <p>Task Preference Optimization (TPO): Transforms annotations from various fine-grained visual tasks (such as object tracking, segmentation, and temporal localization) into differentiable task preferences, thereby guiding the model&rsquo;s self-learning to extend its capabilities to specialized visual applications.</p> </li> </ul> <p>Additionally, the team pre-trained InternVideo2.5 using a progressive multi-stage training strategy on over 300,000 hours of video data, ensuring robust video processing capabilities. The training corpus includes vision-language alignment data, long video sequences, and specialized visual task data, providing abundant information for comprehensive model learning. Following the progressive training scheme of Shusheng InternVL, the approach enhances fine-grained perception and temporal understanding in stages: initial basic learning for task recognition and video-language alignment; subsequent integration and training of specific task components alongside visual concept pre-training; and finally, multi-task training combined with instruction fine-tuning on mixed corpora to optimize all model components. This method achieves effective scaling from &ldquo;small to large&rdquo; and refinement of data from &ldquo;coarse to fine&rdquo;, reducing costs while enhancing performance.</p> <p><a href="https://mp.weixin.qq.com/s/kId4bxMbbR4kT2Q_HXCpsg" target="_blank">View Original</a></p> The Chinese Academy of Sciences Academicians Forum on the Healthy Development and Empowerment of Large Models/AIGC Held in Nanjing https://cs.nju.edu.cn/lm/en/post/2024-01-16-healthy-development-and-empowerment-of-large-models-aigc/ Tue, 16 Jan 2024 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/post/2024-01-16-healthy-development-and-empowerment-of-large-models-aigc/ <div class="img-full-width"> <p> <figure > <div class="flex justify-center "> <div class="w-100" ><img alt="image" srcset=" /lm/post/2024-01-16-healthy-development-and-empowerment-of-large-models-aigc/image_hu_2328bc696d26508c.jpg 400w, /lm/post/2024-01-16-healthy-development-and-empowerment-of-large-models-aigc/image_hu_4acc13ce51d786c0.jpg 760w, /lm/post/2024-01-16-healthy-development-and-empowerment-of-large-models-aigc/image_hu_ed6c64a9ba8630c8.jpg 1200w" src="https://cs.nju.edu.cn/lm/post/2024-01-16-healthy-development-and-empowerment-of-large-models-aigc/image_hu_2328bc696d26508c.jpg" width="760" height="250" loading="lazy" data-zoomable /></div> </div></figure> </p> </div> <p>The 155th Frontier Forum of the Chinese Academy of Sciences Academicians — &ldquo;The Healthy Development and Empowerment of Large Models/AIGC&rdquo; was held in Nanjing from January 6 to 7, 2024. The forum was organized by the Chinese Academy of Sciences Academicians, hosted by the Academic and Publishing Work Committee and the Standing Committee of the Information Technology Science Department of the Chinese Academy of Sciences, co-organized by Nanjing University, Southeast University, and the publisher &ldquo;Science in China&rdquo;, with Academicians Lu Jian and Huang Ru, along with Academician Wang Jian from the Chinese Academy of Engineering, jointly serving as forum chairs.</p> <p>Academician Bao Xinhai, Director of the Academic and Publishing Work Committee, attended the forum along with Zhou Dejin from the Work Bureau of the Chinese Academy of Sciences Academicians, Ren Youqun from the Ministry of Education’s Teacher Work Department, Academician Huang Ru from Southeast University, and Xu Guanghui from the Jiangsu Science and Technology Department, who delivered opening remarks.</p> <p>Six academicians from the Chinese Academy of Sciences—including Bao Xinhai, Lu Jian, Huang Ru, Tan Tieniu, E Weinan, and Xu Zongben—two academicians from the Chinese Academy of Engineering, Gao Wen and Yang Shanlin, and nearly 300 experts from 87 universities, research institutes, and companies (including the Chinese Academy of Sciences, Nanjing University, Southeast University, Hong Kong University of Science and Technology, iFlytek, Huawei, Alibaba, Xiaomi, Midea, and Geely Automobile Research Institute) attended the forum, with more than half being young scientists under 45.</p> <p>The forum comprised two sessions: keynote presentations and special topic reports. In the keynote session, Academician Tan Tieniu discussed trends in generative AI; Academician Gao Wen introduced the Pengcheng Brain pre-trained large model platform and open-source collaborations; Academician Yang Shanlin presented AIGC and its scientific foundations; Academician E Weinan explained the basics of deep learning; Academician Xu Zongben discussed mathematical research on large models; Professor Guo Yike, an Academician of the Royal Academy of Engineering (UK) and Vice-Chancellor of Hong Kong University of Science and Technology, addressed the intrinsic scientific issues of large models; and AI experts from iFlytek, Huawei, and Alibaba showcased applications and innovative practices of large models.</p> <p>In the special topic session, experts presented reports on eight topics: &ldquo;Frontier and Collaborative Innovation in the Development of Large Models/AIGC&rdquo;, &ldquo;Empowering Technological Development with Large Models/AIGC&rdquo;, &ldquo;Boosting the Real Economy with Large Models/AIGC&rdquo;, &ldquo;Facilitating Educational Transformation with Large Models/AIGC&rdquo;, &ldquo;Large Models/AIGC and Intelligent Basic Software&rdquo;, &ldquo;Large Models/AIGC, Computing Infrastructure, and Chip Technology&rdquo;, &ldquo;Safety, Controllability, Privacy Protection, and Low-cost Deployment of Large Models/AIGC&rdquo;, and &ldquo;Governance and Management of Large Models/AIGC&rdquo;. Following the reports, experts engaged in roundtable discussions on these topics.</p> <p>After two days of discussions, the experts explored key technologies and challenges in the development of large models and AI, application scenarios, industrial empowerment, and legal and ethical risks, reaching some preliminary consensus. The forum outcomes will be released in the form of briefings and special reports.</p> <p><a href="http://ad.cas.cn/xbdt2022/202401/t20240116_5000694.html" target="_blank">Read Original Article</a></p> Rong Gu https://cs.nju.edu.cn/lm/en/authors/rong-gu/ Tue, 01 Jan 1050 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/rong-gu/ <p>Rong Gu is currently a distinguished researcher in the School of Computer Science at Nanjing University, PhD supervisor, Young Yangtze River Scholar of the Ministry of Education, and DAMO Academy Young Fellow Award winner (2023). He serves as the community chair of the Fluid open source project (CNCF Sandbox project under Linux Foundation), executive committee member of ACM ChinaSys, and executive committee member of multiple CCF technical committees including Distributed Computing and Systems, Big Data Expert Committee, System Software Committee, Database Committee, and Open Source Development Committee.</p> <p><strong>Research Areas:</strong> His main research fields include cloud computing and big data systems, intelligent computing systems. Currently focusing on LLM inference and training systems, cloud-native computing systems, intelligent data management, etc. He has published over 70 papers in top-tier international academic journals and conferences including USENIX ATC, EuroSys, SIGMOD, VLDB, ICDE, KDD, WWW, INFOCOM, VLDBJ, IEEE TPDS, TON, TKDE, published 3 academic monographs, and holds 18 authorized invention patents.</p> <p><strong>Research Projects:</strong> He leads National Natural Science Foundation projects (Youth/General), sub-projects of National Key R&amp;D Program, China Postdoctoral Science Foundation Special Grant, as well as industry collaboration projects with Huawei/Alibaba/Tencent/Ant Group/China Mobile/Sinopec. His research results have been adopted by leading enterprises and internationally renowned open source systems including Apache Spark and Alluxio, and he initiated the Fluid open source project under Cloud Native Computing Foundation.</p> <p><strong>Awards and Honors:</strong> He has received Jiangsu Province Science and Technology First Prize, Alibaba DAMO Academy Young Fellow Award (2023), IEEE TCSC Award for Excellence in Scalable Computing (Early Career, 2022, 5 recipients globally per year), CCF Distributed Computing and Systems Committee Young Innovation Pioneer (2 recipients nationally per year), IEEE HPCC Best Paper Award, CCF Big Data Conference Best Application Paper Award, Alibaba Outstanding Academic Collaboration Project Award, Huawei &ldquo;Challenge Problems&rdquo; Spark Award, Tencent Cloud Most Valuable Professional Award, ZTE Industry-University-Research Outstanding Collaboration Project Award, China Open Source Innovation Competition First Prize, Nanjing University May 4th Youth Medal, Jinling Young Scholar, CloudSort Track Champion in International Computer Sorting Benchmark Competition, ACM Nanjing Chapter Academic Rising Star Award, Jiangsu Computer Society Young Scientist Award/Outstanding Science and Technology Worker, China Academy of Information and Communications Technology OSCAR Peak Open Source Figure.</p> Large Language Model Research Group https://cs.nju.edu.cn/lm/en/research/llm/ Fri, 01 Jan 1030 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/research/llm/ <h2 id="large-language-models-a-new-era-of-artificial-intelligence-in-language-understanding-and-generation">Large Language Models: A New Era of Artificial Intelligence in Language Understanding and Generation</h2> <p><strong>Large Language Models (LLMs)</strong> are one of the most groundbreaking technologies in the field of artificial intelligence in recent years. Trained on massive datasets, they are capable of understanding and generating natural language, demonstrating near-human-level performance in tasks such as text generation, translation, and question answering. This marks the dawn of a new era in artificial intelligence for language understanding and generation.</p> <h3 id="core-technological-breakthroughs">Core Technological Breakthroughs</h3> <p>The success of large language models is attributed to breakthroughs in the following key technologies:</p> <p><strong>In summary, large language models are profoundly transforming how we interact with machines and bringing unprecedented opportunities to various industries.</strong> With continuous technological advancements and expanding application scenarios, they will continue to drive progress in the field of artificial intelligence for language understanding and generation, creating more value for human society.</p> Multimodal Large Model Research Group https://cs.nju.edu.cn/lm/en/research/multimodal/ Sat, 01 Jan 1020 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/research/multimodal/ <p>Multimodal Large Model Research Group is committed to promoting the research and application of multimodal large models, exploring key technologies in multimodal information fusion, interaction, and reasoning, and driving the application of multimodal large models in visual, speech, text, and other multimodal data, thereby providing technical support for the development of multimodal intelligent technologies.</p> Embodied Decision Large Model Research Group https://cs.nju.edu.cn/lm/en/research/embodied/ Wed, 01 Jan 1017 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/research/embodied/ <p>Embodied Decision Large Model Research Group focuses on cutting-edge research in embodied intelligence, aiming to build a generalizable embodied agent through study in representation learning, policy learning, and hierarchical planning and execution.</p> Large Model Knowledge Enhancement Research Group https://cs.nju.edu.cn/lm/en/research/llm+knowledge/ Sun, 01 Jan 1015 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/research/llm+knowledge/ <p>The LLM + Knowledge research group has been engaged in long-term research on large language model knowledge enhancement, controllable generation, and domain-specific construction.</p> Large Model Learning Algorithms and Platform Research Group https://cs.nju.edu.cn/lm/en/research/learning-algorithm-and-platform/ Mon, 01 Jan 1010 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/research/learning-algorithm-and-platform/ <p>The Large Model Learning Algorithms and Platform Research Group focuses on the construction of systems based on large models, large-scale training/inference deployment, and the application of large models. The group conducts research to address key challenges in efficient training, deployment, and the integration of domain knowledge into large models. In terms of applications, the group has a strong focus on reasoning tasks such as Automated Theorem Proving (ATP). In undergraduate education, the group offers courses on large model development, training students to build large models from scratch.</p> Large Model Systems and Platforms Research Group https://cs.nju.edu.cn/lm/en/research/software-and-system/ Mon, 01 Jan 1010 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/research/software-and-system/ <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-markdown" data-lang="markdown"><span class="line"><span class="cl"><span class="gu">## Large Model Systems and Platforms: The Core Engine Driving the Scalable Application of Artificial Intelligence </span></span></span><span class="line"><span class="cl"><span class="gu"></span> </span></span><span class="line"><span class="cl">With the rapid development of large model technology, efficiently training, deploying, and managing these massive models has become a critical challenge. <span class="gs">**Large model systems and platforms**</span> have emerged to address this need, providing the infrastructure and toolchains necessary for the development and application of large-scale artificial intelligence models. They serve as the core engine driving the scalable application of AI. </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="gu">### Core Features and Capabilities </span></span></span><span class="line"><span class="cl"><span class="gu"></span> </span></span><span class="line"><span class="cl">Large model systems and platforms typically offer the following core functionalities: </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">1.</span> <span class="gs">**Distributed Training**</span>: </span></span><span class="line"><span class="cl"> <span class="k">-</span> Supports distributed training for massive datasets and ultra-large models. </span></span><span class="line"><span class="cl"> <span class="k">-</span> Provides efficient parallel computing and communication optimization, such as data parallelism, model parallelism, and pipeline parallelism. </span></span><span class="line"><span class="cl"> <span class="k">-</span> Representative examples: Megatron-LM, DeepSpeed. </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">2.</span> <span class="gs">**Efficient Inference**</span>: </span></span><span class="line"><span class="cl"> <span class="k">-</span> Optimizes inference for large models to reduce latency and resource consumption. </span></span><span class="line"><span class="cl"> <span class="k">-</span> Supports model compression, quantization, and acceleration techniques. </span></span><span class="line"><span class="cl"> <span class="k">-</span> Representative examples: TensorRT, ONNX Runtime. </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">3.</span> <span class="gs">**Model Management and Deployment**</span>: </span></span><span class="line"><span class="cl"> <span class="k">-</span> Offers version control, monitoring, and updating capabilities for models. </span></span><span class="line"><span class="cl"> <span class="k">-</span> Supports deployment across multiple environments, including cloud, edge, and devices. </span></span><span class="line"><span class="cl"> <span class="k">-</span> Representative examples: MLflow, Kubeflow. </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">4.</span> <span class="gs">**Developer Tools and Ecosystem**</span>: </span></span><span class="line"><span class="cl"> <span class="k">-</span> Provides user-friendly APIs, SDKs, and visualization tools. </span></span><span class="line"><span class="cl"> <span class="k">-</span> Builds open developer communities and ecosystems. </span></span><span class="line"><span class="cl"> <span class="k">-</span> Representative examples: Hugging Face, OpenAI API. </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="gu">### Representative Platforms and Systems </span></span></span><span class="line"><span class="cl"><span class="gu"></span> </span></span><span class="line"><span class="cl">The following are some notable large model systems and platforms: </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">-</span> **Hugging Face**: Offers a rich collection of pre-trained models and datasets, supporting model training, fine-tuning, and deployment. </span></span><span class="line"><span class="cl"><span class="k">-</span> **OpenAI API**: Provides powerful interfaces for large model services, enabling tasks like text generation and code generation. </span></span><span class="line"><span class="cl"><span class="k">-</span> **DeepSpeed**: Developed by Microsoft, focuses on distributed training and optimization for large-scale models. </span></span><span class="line"><span class="cl"><span class="k">-</span> **Colossal-AI**: Delivers efficient solutions for parallel training and inference, supporting ultra-large models. </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="gu">### Future Development Trends </span></span></span><span class="line"><span class="cl"><span class="gu"></span> </span></span><span class="line"><span class="cl">The future development of large model systems and platforms will focus on the following directions: </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">1.</span> <span class="gs">**Performance Optimization**</span>: Further improves training and inference efficiency while reducing resource consumption. </span></span><span class="line"><span class="cl"><span class="k">2.</span> <span class="gs">**Usability Enhancement**</span>: Simplifies development processes and lowers the barrier to entry. </span></span><span class="line"><span class="cl"><span class="k">3.</span> <span class="gs">**Ecosystem Expansion**</span>: Builds a more open and thriving developer ecosystem. </span></span><span class="line"><span class="cl"><span class="k">4.</span> <span class="gs">**Security and Trustworthiness**</span>: Strengthens model security and explainability to ensure reliable applications. </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl">--- </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="gs">**In summary, large model systems and platforms are the critical enablers for the practical application of large model technology.**</span> With continuous technological advancements and ecosystem improvements, they will provide stronger momentum for the scalable application of artificial intelligence, driving intelligent transformation across industries. </span></span></code></pre></div> Scientific Large Model Research Group https://cs.nju.edu.cn/lm/en/research/science/ Tue, 01 Jan 1005 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/research/science/ <p>The Scientific Large Model Research Group is dedicated to advancing interdisciplinary research in drug development, materials innovation, and energy optimization through state-of-the-art computational simulations.</p> Medical Imaging Large Model Research Group https://cs.nju.edu.cn/lm/en/research/medical-image/ Sun, 01 Jan 1004 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/research/medical-image/ <p>The Nanjing University Medical Imaging Large Model Research Group has been deeply engaged in the field of intelligent medical imaging, focusing on efficient training and low-cost fine-tuning of large models to explore cutting-edge applications in medical image segmentation, auxiliary diagnosis, and precision treatment. Facing the challenge of high annotation costs, the group is dedicated to methods such as sparse supervision, efficient data utilization, and pseudo-label optimization to reduce reliance on large-scale manual annotations while enhancing model generalization and robustness.</p> Edge Large Model System Research Group https://cs.nju.edu.cn/lm/en/research/edge/ Sat, 01 Jan 1003 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/research/edge/ <p>The Edge Large Model System Research Group focuses on frontier optimization techniques for large model systems. Centered on building a high-precision, low-latency, and scalable large model service framework, our research covers operator optimization, adaptive parameter tuning, and multimodal task scheduling.</p> Large Language Model System Research Group https://cs.nju.edu.cn/lm/en/research/system-optim/ Wed, 01 Dec 1002 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/research/system-optim/ <p>The NASA research group of Nanjing University, in collaboration with renowned institutions such as Pengcheng Laboratory and Huawei Technologies Co., Ltd., has conducted comprehensive and in-depth research on key topics including large model training/inference performance and power consumption. The research achievements have not only been published at top conferences in the field of computer architecture, but have also been successfully deployed in relevant enterprises, making positive contributions to bridging the gap between theory and practice.</p> Cloud Large Model System Research Group https://cs.nju.edu.cn/lm/en/research/cloud-system/ Mon, 01 Nov 1002 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/research/cloud-system/ <p>The Cloud Large Model System Research Group is dedicated to exploring system-level performance optimization technologies for large model training, inference, and deployment in cloud environments. The team&rsquo;s key research directions include: storage management optimization for cloud-based large models, efficient distribution and loading mechanisms, training process optimization strategies, and inference performance optimization technologies.</p> Controllable Generation Research Group https://cs.nju.edu.cn/lm/en/research/controllable-generation/ Fri, 01 Oct 1002 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/research/controllable-generation/ <p>The Controlled Generation Research Group has long been engaged in research related to the generation capabilities of large language models and multimodal models. Currently, the group focuses on controllable generation techniques for large models to enhance their outputs along specific attributes. Their research centers on methods such as intervention and guidance of large models, conditional control in multimodal large models, and neuron or activation localization within large models to steer their generation.</p> Chen Tian https://cs.nju.edu.cn/lm/en/authors/chen-tian/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/chen-tian/ Chengying Huan https://cs.nju.edu.cn/lm/en/authors/chengying-huan/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/chengying-huan/ Guihai Chen https://cs.nju.edu.cn/lm/en/authors/guihai-chen/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/guihai-chen/ Haipeng Dai https://cs.nju.edu.cn/lm/en/authors/haipeng-dai/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/haipeng-dai/ Jingwei Xu https://cs.nju.edu.cn/lm/en/authors/jingwei-xu/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/jingwei-xu/ Limin Wang https://cs.nju.edu.cn/lm/en/authors/limin-wang/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/limin-wang/ Meng Li https://cs.nju.edu.cn/lm/en/authors/meng-li/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/meng-li/ Qing Gu https://cs.nju.edu.cn/lm/en/authors/qing-gu/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/qing-gu/ Shujian Huang https://cs.nju.edu.cn/lm/en/authors/shujian-huang/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/shujian-huang/ Tianfan Fu https://cs.nju.edu.cn/lm/en/authors/tianfan-fu/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/tianfan-fu/ Tong Lu https://cs.nju.edu.cn/lm/en/authors/tong-lu/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/tong-lu/ Wei Hu https://cs.nju.edu.cn/lm/en/authors/wei-hu/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/wei-hu/ Wujun Li https://cs.nju.edu.cn/lm/en/authors/wujun-li/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/wujun-li/ Yinghuan Shi https://cs.nju.edu.cn/lm/en/authors/yinghuan-shi/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/yinghuan-shi/ Yuan Yao https://cs.nju.edu.cn/lm/en/authors/yuan-yao/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/yuan-yao/ Zequn Sun https://cs.nju.edu.cn/lm/en/authors/zequn-sun/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/zequn-sun/ Zhibin Wang https://cs.nju.edu.cn/lm/en/authors/zhibin-wang/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/zhibin-wang/ Zhiwei Jiang https://cs.nju.edu.cn/lm/en/authors/zhiwei-jiang/ Mon, 01 Jan 0001 00:00:00 +0000 https://cs.nju.edu.cn/lm/en/authors/zhiwei-jiang/