Updates on Our Research Achievements（XIX）Protein Language Pre-trained Models for Downstream Function Prediction

发布者：汤靖玲发布时间：2025-12-10浏览次数：27

Recent Research Achievement Published inNature Methods: The Protein Trillion-Parameter Language Model xTrimoPGLM, Co-developed by Assistant Professor Wang Boyan from the Reasoning and Learning Group at the School of Artificial Intelligence, Nanjing UniversityThe research on xTrimoPGLM, a protein language pre-trained model with hundreds of billions of parameters co-developed by Assistant Professor Wang Boyan from the Reasoning and Learning Group at the School of Artificial Intelligence, Nanjing University, was recently published in the journalNature Methods.

xTrimoPGLM is a unified pre-training framework and foundational model designed for a wide range of protein-related tasks, including understanding, reasoning, generation, and discovery. Unlike existing protein language models that solely employ an encoder (e.g., ESM) or a causal decoder (e.g., ProGen), xTrimoPGLM adopts a General Language Model (GLM) as its backbone architecture, leveraging bidirectional attention and an autoregressive objective. To effectively enhance the model's comprehension capability, the research team incorporated a Masked Language Model (MLM) objective in the bidirectional prefix region, while utilizing the GLM objective to optimize its generative capacity.

Furthermore, the team compiled a large-scale pre-training dataset comprising approximately 940 million unique protein sequences, totaling around 200 billion residues. The model was trained on a cluster of 96 NVIDIA DGX machines (each equipped with 8×A100 GPUs), resulting in a 100-billion-parameter model with over one trillion tokens, establishing it as the currently largest and most comprehensive Protein Language Model (PLM).

As a foundational PLM, xTrimoPGLM demonstrates outstanding performance in protein understanding tasks. Extensive and rigorous experiments employing linear probing and low-rank fine-tuning techniques show that xTrimoPGLM significantly surpasses previous state-of-the-art methods in 15 out of 18 diverse tasks, encompassing protein structure, interactions, simple functionality, and discovery (Fig. 1A). The research team also demonstrated that xTrimoPGLM achieves lower perplexity (PPL) than other reference models on two out-of-distribution (OOD) protein sets (Fig. 2B). These results validate the model's adherence to scaling laws, where larger models generally yield superior performance (Fig. 2C and Fig. 1B).

Figure 1. Performance of xTrimoPGLM on protein understanding tasks

Figure 2. The framework of xTrimoPGLM and its training effect

Based on xTrimoPGLM, the research team developed a high-performance protein structure prediction tool. Inspired by the ESMFold method, this tool integrates protein folding information with the protein language model to optimize structure prediction training. The team named itxTrimoPGLM-Fold(abbreviated asxT-Fold). It demonstrates excellent TM-scores on the CAMEO and CASP15 protein benchmark datasets. Furthermore, xT-Fold was optimized using 4-bit quantization technology, enhancing both its performance and efficiency, establishing it as an efficient and foundational structural prediction technique within the PLM framework. Results show that on the CASP15 dataset, xT-Fold achieves a 5% higher TM-score than ESMFold while also delivering faster inference speed.

Additionally, to address the challenges that existing pre-trained models face in adapting to complex, structured functional predictions and the underutilization of valuable multimodal information due to insufficient fusion, Assistant Professor Wang Boyan and his team, guided by cognitive structure principles, proposedProtGOas a solution.

ProtGO is a general-purpose protein function prediction model. It effectively leverages knowledge from different modalities. Through cognitive integration, it combines information embedded in protein sequences, descriptive text, biological taxonomy, and label graph structures related to Gene Ontology (GO), uncovering shared functional and evolutionary associations to form a unified knowledge representation (as shown in Figure 3). It stands as one of the most effective protein-oriented GO function prediction models available today.

Figure 3. Framework of ProtGO: A General Multimodal Protein Function Prediction Model

Simulating the part-whole hierarchy of cognitive structure, ProtGO achieves an adaptable and lightweight architecture through the gradual and thorough integration of other modalities. This enables it to effectively adapt to Protein Language Models (PLMs) and biological language models with varying parameters and architectures, assisting them in rapidly adapting to Gene Ontology (GO)-based protein function prediction tasks.

Extensive experiments demonstrate that PLMs enhanced by ProtGO achieve an 8% to 27% improvement in the maximum F1 measure (Fmax) compared to their original counterparts. This effectively boosts the accuracy of various PLMs on GO prediction tasks, significantly surpassing the prediction accuracy of related classic GO prediction models (as shown in Figure 4). This method has been published in the journalBioinformatics.

Figure 4. Results on the Gene Ontology prediction task across the Fmax, F1, AUPR, and MCC metrics