Журнал «Труды Института системного анализа Российской академии наук» - A. Aldarf "Compact Empirical GPU Scaling Framework for Neural Network Transformer Inference"

Transformer models are able to achieve state of the art results across multiple NLP tasks; however, there are still significant challenges in deploying them due to high inference latency and hardware costs. We empirically characterize transformer encoder inference across GPU generations based on over one thousand controlled measurements of DistilBERT, BERT-base, and BigBird-RoBERTa on NVIDIA T4, A10G, and L40S. Our analysis demonstrates that architectural efficiencies are consistent across models and that throughput-performance structure is smooth, low dimensional, and consistent across hardware generations. Utilizing these regularities, we develop a two-point cross-GPU scaling model to predict full throughput-sequence length curves using only a compute-dominated and a memory-dominated measurement. Prediction error decreases with batch size: from 12-13% MAPE at batch size 16 to 6% at batch size 256. The framework allows forecasting throughput, selecting models, planning hardware needs, and minimizing benchmarking on new accelerators.

Полная версия статьи в формате pdf.

EDN: CSIJPK

Стр. 67-76.

References

1. Kalyan K.S., Rajasekharan A., Sangeetha S. AMMUS: A survey of transformer-based pretrained models in natural language processing. arXiv 74 Труды ИСА РАН. Том 76. 1/2026 Информационные технологии A. Aldarf [Preprint]. 2021. https://doi.org/10.48550/arXiv.2108.05542

2. Santilli A., Severino S., Postolache E., Maiorca V., Mancusi M., Marin R., et al. Accelerating transformer inference for translation via parallel decoding. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023;1:12336-12355.

3. Tithi J.J., Wu H., Abuhatzera A., Petrini F. Scaling intelligence: Designing data centers for next-gen language models. arXiv [Preprint]. 2025. https://doi.org/10.48550/arXiv.2506.15006

4. Joshy A., Sundar S. Analyzing the performance of sentiment analysis using BERT, DistilBERT, and RoBERTa. In: 2022 IEEE International Power and Renewable Energy Conference (IPRECON). 2022:1-6. https://doi.org/10.1109/IPRECON55716.2022.10059542

5. Marchenko O., Vrublevskyi V. Comparison of transformer-based deep learning methods for the paraphrase identification task. In: Information Technology and Implementation (IT&I-2023). 2023;3624:447-455. Available from: https://ceurws.org/Vol-3624/Short_5.pdf [Accessed 03 October 2025].

6. Kasoju A., Vishwakarma T.C. Optimizing transformer models for low-latency inference: Techniques, architectures, and code implementations. International Journal of Science and Research. 2025;14(4):857-866. https://doi.org/10.21275/SR25409073105

7. Zhai Y., Zhang Y., Liu S., Chu X., Peng J., Ji J., et al. TLP: A deep learning-based cost model for tensor program tuning. In: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2023;2:833-845. https://doi.org/10.1145/3575693.3575737

8. Bahri Y., Dyer E., Kaplan J., Lee J., Sharma U. Explaining neural scaling laws. Proceedings of the National Academy of Sciences. 2024;121(27):e2311878121. https://doi.org/10.1073/pnas.2311878121

9. Wu Y., Sun Z., Li S., Welleck S., Yang Y. Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving. In: The Thirteenth International Conference on Learning Representations (ICLR). 2025. https://openreview.net/forum?id=VNckp7JEHn

10. Cho B.Y., Jung J., Erez M. Accelerating bandwidth-bound deep learning inference with main-memory accelerators. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC21). 2021:1-14. Available from: https://dl.acm.org/doi/10.1145/3458817.3476146 [Accessed 03 October 2025].

11. Du J., Jiang J., Zheng J., Zhang H., Huang D., Lu Y. Improving computation and memory efficiency for real-world transformer inference on GPUs. ACM Transactions on Architecture and Code Optimization. 2023;20(4):46. https://doi.org/10.1145/3617689

12. Ridnik T., Lawen H., Noy A., Ben Baruch E., Sharir G., Friedman I. TResNet: High performance GPU-dedicated architecture. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 2021:1400-1409. https://doi.org/10.1109/WACV48630.2021.00144