Forum Review|Our institute successfully hosted the China-Southeast Asia Kerry Young Scientist Forum & Guanlan Large Model Summit (I)

发布者:汤靖玲发布时间:2025-11-11浏览次数:30

  On November 7, 2025, the China-Southeast Asia Kerry Young Scientist Forum and Guanlan Large Model Summit was successfully held at the Suzhou Campus of Nanjing University. Young scholars from around the world, representing institutions such as the National University of Singapore, Nanyang Technological University, the University of Oxford, the Hong Kong University of Science and Technology, the University of Macau, and Microsoft Research Asia, engaged in in-depth discussions on the theoretical frontiers, engineering practices, and industrial applications of large models. Together, they explored pathways for regional collaborative innovation and high-quality development.The forum was attended and addressed by Ms. Jiang Tian, Assistant President of Nanjing University and Executive Deputy Secretary of the CPC Working Committee of its Suzhou Campus. It was chaired by a group of faculty members from the School of Intelligent Science and Technology at Nanjing University, including Associate Professor Ji Wei, and Assistant Professors Liu Jiaheng, Wang Boyan, and Zhang Zhen.


 

    Jiang Tian extended a warm welcome, on behalf of the Suzhou Campus of Nanjing University, to the domestic and international experts and scholars. He noted that large model technology, serving as the core engine of next-generation AI and its fundamental infrastructure, is reshaping national strategies, scientific research, industrial applications, and all aspects of societal life. He highlighted Nanjing University's commitment to advancing the deep integration of intelligent science with multiple disciplines, conducting cutting-edge exploration in areas such as perceptual intelligence, embodied intelligence, and cognitive decision-making.He stated that this forum, leveraging the State Key Laboratory of Novel Software Technology, actively aims to build an integrated industry-university-research-application ecosystem. By combining generative AI, represented by large models, it seeks to bring profound changes to fields like natural language processing, knowledge acquisition, and cross-modal understanding and generation, thereby helping to establish the Suzhou Campus as a hub for emerging engineering disciplines.He emphasized that the forum has gathered leading scholars and young scientists from China, Singapore, the UK, Australia, and other countries and regions to jointly discuss the future trajectory of large models and multimodal intelligence, focusing on frontiers, trustworthiness, security, and engineering implementation. This aims to connect young scholars from China and Southeast Asia, fostering cross-regional collaborative innovation and high-quality development.He expressed his hope that this forum could serve as a starting point for international exchange and cooperation among young scholars from China and Southeast Asia. He looks forward to it propelling in-depth dialogue and practical collaboration in the field of artificial intelligence between China, Southeast Asia, and the wider world, working together to build a more open, inclusive, and intelligent future.




Expert Reports


Bingsheng He Professor (National University of Singapore)  Towards Large Reasoning Models as Judge


    Bingsheng He addressed the emerging trend of LLM as a Reviewer, pointing to 2025 as a potential tipping point for Agentic AI. He described the evolution of models from mere question answerers to taking on the role of judge/reviewer in scenarios like code evaluation and academic peer review.He systematically revealed the biases—such as Bandwagon, Authority, Position, and Distraction effects—that influence large models' decision-making. Through bias injection benchmarks, his team discovered that even chain-of-thought reasoning models exhibit spurious reasoning bias when confronted with false evidence, leading to a significant drop in robustness.To tackle these issues of bias and robustness, his team's research includes bias mitigation techniques like bias self-reflection prompting and reinforcement learning strategies employing group-wise rewards, which have substantially enhanced robustness and efficiency in complex reasoning scenarios.Looking ahead, he advocated for hardware-software co-design to achieve a million-fold improvement in energy efficiency and called for building a more transparent and sustainable academic ecosystem to advance responsible AI review and open collaboration.


Wei Lu Professor (Nanyang Technological University) Small Language Models: From Pre-training to Post-training

   

    Wei Lu presented a vision centered on a affordable, reproducible, and scalable small-model approach. He shared his team's experience in pre-training the 1.1-billion-parameter TinyLlama from scratch on a limited budget (S$50,000), with its training configuration, loss curves, and open license all made public, resulting in monthly community downloads reaching the millions.Regarding post-training, he argued that a structured reasoning process is superior to merely lengthening the chain-of-thought (CoT). His method uses tabular/multi-dimensional prompts to guide the model through a step-by-step and verifiable reasoning process. Empirical evidence shows that for small models, sufficient and high-quality instruction tuning is a prerequisite for successful reinforcement learning; without it, models easily fall into the trap of producing verbose and formulaic responses.He also discussed the lower limits of model scale and the feasibility of distillation and compression for small models. He advocated for using small models to compete with large ones, exploring the minimal model size required for competent reasoning capabilities to achieve superior solutions in terms of energy efficiency and reproducibility.



Jindong Gu Researcher (University of Oxford / Google) Responsible Visual Generative AI


         

    Jindong Gu addressed safety and reliability concerns in large-scale diffusion models, proposing a comprehensive detection-constraint-tracing framework. This approach involves implementing scalable safety alignment and content filtering mechanisms before and during the generation process to suppress harmful content. By learning interpretable semantic directions in the latent space, the method enables fine-grained control to suppress or enhance sensitive attributes such as violence and discrimination, while avoiding entanglement with unrelated features. Additionally, leveraging techniques like classifier-guided and representation reconstruction, it allows for source identification of generated content without accessing the model weights.The presentation also examined methods and countermeasures for Multimodal Pragmatic Jailbreak, as well as potential risks to the model ecosystem, such as text watermarking. This series of works constitutes a coordinated technical strategy aimed at fostering the healthy development of the entire AIGC ecosystem, providing a systematic solution for building a safer, more trustworthy, and responsible visual content generation environment.


Long Chen Assistant Professor (The Hong Kong University of Science and Technology) The Interplay of Understanding and Generation in Multimodal AI


        


    Long Chen highlighted the significant gap that remains between high-level semantic understanding and low-level spatial reasoning in current unified multimodal models. The team has made progress on three key fronts: First, by attaching a plug-and-play segmentation head to a frozen large multimodal model, they leverage spatial cues from attention maps to enhance pixel-level understanding with minimal computational overhead. Second, for text-guided image editing, they proposed FlowCycle, which uses flow-matching and cycle-consistency constraints to preserve background details while employing source consistency editing to ensure harmony between the main subject and its environment. Third, they revealed the critical role of noise in diffusion model classification, demonstrating that optimized noise matching can significantly improve discriminative robustness.Furthermore, the team introduced GIR-Bench, a reasoning-centric benchmark designed to systematically evaluate the alignment between understanding and generation capabilities in unified models. This benchmark employs a customized evaluation process that avoids the subjective bias inherent in using large models as judges, and it clearly demonstrates that a significant chasm still exists between these two capabilities in even the most advanced current models.



Yang Deng Assistant Professor (Singapore Management University) Towards Human-centered Proactive Conversational Agents


        

  

    Addressing the limitations of current conversational AI, such as passivity and lack of controllability, Yang Deng proposed a framework for building a Human-Centered Proactive conversational agent. This framework is built on three core pillars: Intelligence (proactive planning), Adaptability (personalized interaction), and Robustness (trustworthiness and safety).At the Intelligence level, he modeled multi-turn dialogues as a Markov Decision Process (MDP) and designed a small model for planning, large model for execution architecture. After initialization via supervised learning, the system leverages an LLM as both a reward model and a user simulator to create a reinforcement learning loop for continuous policy optimization.On the front of Adaptability, to achieve personalized responses for different users, the team incorporated psychological frameworks such as the Big Five personality traits and decision-making styles. They constructed diverse virtual users for multi-agent simulation, thereby enhancing the model's generalization capability and the effectiveness of its evaluation.For ensuring Robustness, tackling issues like the model's overconfidence and hallucinations, the research focused on defining the knowledge boundary, categorizing it into four distinct types. By integrating methods like refusal fine-tuning and self-reflection, the model is enabled to recognize the limits of its own knowledge.Experiments confirm that agents operating within this framework can maintain personality consistency in generated dialogues. Furthermore, when endowed with a personality, their conversational strategies demonstrate greater empathy and explorativeness.



Zhedong Zheng Assistant Professor (University of Macau) Cognitive Biases in Large Multimodal Models: Unveiling Challenges and Solutions

     

  

    Zhedong Zheng drew on classic phenomena from psychology and neuroscience to systematically analyze cognitive biases in multimodal models. In scenarios resembling the Stroop Effect, models are easily misled by overlaid text. The Weber-Fechner Law reveals that recognition thresholds are influenced by familiarity, leading to a decline in cross-cultural identification. Insights from Split-Brain Experiments suggest that models may exhibit structural biases analogous to those of the two cerebral hemispheres.To enhance dependability, he proposed quantifying uncertainty by repeatedly sampling the model's responses to the same question and then clustering the answers or calculating their entropy: greater dispersion in the answers indicates higher uncertainty, which can trigger a cautionary mechanism. This metric requires no access to the model's internal weights, is easy to integrate, and can serve as a black-box measure for tasks such as re-ranking in text-image retrieval (e.g., through term masking and combinatorial comparison).He further emphasized that a closed loop of detection–measurement–intervention can be established through data augmentation, Mixture of Experts (MoE), and adversarial questioning, thereby improving model robustness and interpretability.


Kaitao Song Researcher (Microsoft Research Asia) The Beauty of Model Ochestration


    

    Kaitao Song presented a paradigm shift from enhancing single models to constructing flexible model systems, outlining two key approaches.The first, Chain-of-Model, employs a large language model as a controller to parse user intent and plan tasks. It then retrieves and selects suitable components from a model/tool library, executes them on local or remote endpoints, and finally aggregates and compares the results. This forms a traceable, multi-step dependency graph. This orchestration paradigm aligns with the published HuggingGPT framework and emphasizes open connectivity with community resources.The second, Channel Model, divides the representation dimensions into multiple channel chains. Within the attention/Transformer mechanism, multi-head attention is allocated per chain to ensure closed information flow. This supports independent chain-level training, on-demand activation, and progressive expansion. Through intra-chain normalization and chain-specific computation objectives, it prevents cross-chain information confusion. This allows different computational pathways of varying scales to be derived from the same model, enabling parameter reuse and elastic inference. This approach has been prototyped and validated in both linear attention and Transformer architectures, demonstrating the potential to achieve competitive results with fewer parameters or higher dimensions under equivalent configurations.


Hao Fei Senior Researcher (National University of Singapore) Toward Unified and Advanced Multimodal Generalist”


        

  

    Hao Fei proposed a dual-path framework for developing multimodal large models: the first is Unification, aiming to encompass more modalities and task paradigms (understanding, generation, editing, etc.) within a single architecture; the second is Evolution, pursuing continuous breakthroughs in higher dimensions, longer sequences, and more complex reasoning. Representative works include: NExT-GPT, a unified interface for any-to-any modality conversion; Vitron, which integrates pixel-level understanding, generation, and editing within a single framework; and JavisDiT, which employs a Diffusion Transformer for synchronized audio-visual generation. Methodologically, he introduced semantic-equivalent tokenization to enhance the underlying alignment between vision and language.To systematically evaluate Generality, he proposed the General-Level/General-Bench framework. Centered on task breadth + synergistic capability, it establishes a five-level paradigm ranging from specialist to cross-modal synergy, using metrics like surpassing specialists and harmonic mean to quantify synergistic gains. Large-scale evaluations indicate that no current model has reached the highest level.Looking forward, the field should pursue both Breadth and Depth—expanding the coverage of tasks and modalities while simultaneously developing more effective unified architectures and cross-paradigm synergy mechanisms.

    The attending scholars unanimously agreed that the development of large models is transitioning from what can be done to performing tasks more reliably, compliantly, and efficiently. Future exploration will focus on unified paradigms, trustworthiness and safety, and software-hardware co-design as core directions.





Baidu
map