The Softmax function, as a core component in deep learning and machine learning, is widely applied in multi-classification problems, attention mechanisms, recommendation systems, and reinforcement learning. Its core value lies in amplifying input differences through exponential operations and normalizing the output to ensure that the results conform to the characteristics of probability distribution, thereby supporting the model to perform refined classification and decision-making. However, the hardware implementation of Softmax is confronted with multiple challenges such as computational complexity and memory access bottlenecks, numerical stability issues, as well as obstacles in parallelization and synchronization. These challenges will lead to a decline in the deployment efficiency of network models and an increase in resource consumption, limiting their practical application value.
For this purpose, the team led by Associate Professor Wang Yuxuan from the School of Intelligence Science and Technology of Nanjing University conducted in-depth research on various challenges that the Softmax function might encounter during hardware implementation and proposed corresponding innovative methods. The related work was published in the IEEE TCAS-I Subject Excellence Journal.
Job 0ne: A small, efficient Softmax architecture with parallelism and sparse adaptability
Many hardware acceleration architectures for Transformer adopt high parallelism computing and sparse perception processing methods. However, the existing Softmax hardware architecture has not achieved efficient adaptation to the above computing paradigms, which leads to Softmax becoming the main memory access and computing bottleneck in most existing acceleration architectures. To address this challenge, the research team proposed the TEA-SPS architecture to achieve an efficient Softmax hardware architecture with parallelism and sparse adaptability. This architecture first applies the reconfigurable parallel Softmax algorithm CPSS with sparse masks proposed in the article to achieve the integration and adaptation of parallelism and sparsity, and then applies the specific segmented information extractor SPIE proposed in the article to efficiently optimize the nonlinear operators in the algorithm. The sparse Softmax algorithm implemented by this architecture can efficiently adapt to Transformer acceleration architectures with different throughput requirements, featuring excellent characteristics of high energy efficiency and high compatibility. This work has been accepted by IEEE TCAS-I (TEA-SPS: A Tiny and Efficient Architecture for Softmax With Parallelism and Sparsity Adaptability, IEEE Transactions on Circuits and Systems I: Regular Papers accepted for publication.

Figure 1 Schematic diagram of the CPSS algorithm

Figure 2 Hardware Architecture diagram of TE-SPS
Paper Link:https://ieeexplore.ieee.org/document/11184336
Job Two: A High-Precision Softmax Approximation Method and Its Efficient Hardware Implementation
Some existing works adopt methods such as Base-2 Softmax for approximate calculation of Softmax functions to improve hardware efficiency. However, frequent use of Softmax will lead to the accumulation of approximate errors, seriously affecting the inference accuracy of the model. Moreover, it often requires retraining of the original network, which not only brings additional consumption of computing resources, There are still risks such as insufficient compatibility, overfitting or underfitting. For this purpose, the research team proposed the MBS approximation algorithm, which approximates the traditional Softmax using a mixed exponential function with bases of 2 and 4, featuring both hardware-friendliness and high-precision computing capabilities. Compared with the Base-2 Softmax method, MBS can be directly applied to the pre-trained Transformer network without additional training, and has an advantage in computational accuracy, thereby significantly reducing software overhead and enhancing system compatibility. In terms of hardware implementation, this study designed a set of MBS approximate computing hardware architecture with high parallelism and low resource occupation. While maintaining low area and power consumption, the computing accuracy was also improved. This work has been accepted by IEEE TCAS-I (MBS: A High-Precision Approximation Method for Softmax and Efficient Hardware Implementation, IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 72, no. 7, pp. 3366-3375, July 2025.

Figure 3 Schematic diagram of the MBS algorithm

Figure 4: MBS hardware architecture diagram
Paper Link:https://ieeexplore.ieee.org/document/10966265
