Development of a computing backend for Llama.cpp based on Rockchip NPU with support for low-bit computations

Egor Antonyants, Ivan Kashtanov, Rostislav Voloshin

Abstract


The article is devoted to solving the problem of energy-efficient inference of large language models (LLMs) on low-power devices. The paper examines the problem of high computational costs during text generation and presents an overview of existing hardware acceleration methods. Particular attention is paid to using the Neural Processing Unit (NPU) of the Rockchip platform to optimize resource-intensive tensor operations without significant quality loss.

Within the scope of the research, a specialized computational backend for the Llama.cpp framework was developed, integrated into the modular GGML architecture. The proposed software delegates matrix multiplication operations to the NPU, supporting computations in FP16, INT8, and INT4 formats. Mechanisms for weight preprocessing with adaptation to the hardware format, algorithms for smoothing outliers in activations using orthogonal Hadamard transformations, as well as an execution pipeline with computation parallelization were implemented. Experimental testing and comparative analysis of the developed solution against traditional execution on the Central Processing Unit (CPU) were performed. Experimental results on models ranging from 350 million to 8 billion parameters confirm the superiority of the proposed approach: a speedup of input context processing by more than 3 times compared to the CPU backend was recorded. Offloading computations to the specialized accelerator enabled reducing system energy consumption by more than 2 times with minimal accuracy degradation for models exceeding 1 billion parameters. The developed backend demonstrates high efficiency and makes the deployment of modern LLMs on devices with passive cooling and limited power capacity more realistic.


Full Text:

PDF (Russian)

References


Brown T. et al. Language models are few-shot learners // Advances in neural information processing systems. – 2020. – Т. 33. – P. 1877–1901.

Samsi S. et al. From words to watts: Benchmarking the energy costs of large language model inference // 2023 IEEE high performance extreme computing conference (HPEC). – IEEE, 2023. – P. 1-9.

Lang J., Guo Z., Huang S. A comprehensive study on quantization techniques for large language models // 2024 4th International Conference on Artificial Intelligence, Robotics, and Communication (ICAIRC). – IEEE, 2024. – P. 224–231.

Prieto P., Abad P. Edge Deployment of Small Language Models, a comprehensive comparison of CPU, GPU and NPU backends // arXiv preprint arXiv:2511.22334. – 2025.

Intel® Distribution of OpenVINO™ Toolkit [Электронный ресурс] – URL: https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html (дата обращения: 14.01.2026).

AMD Ryzen™ AI Software [Электронный ресурс] – URL: https://www.amd.com/en/developer/resources/ryzen-ai-software.html (дата обращения: 14.01.2026).

Qualcomm AI Engine Direct SDK [Электронный ресурс] // Qualcomm Developer. – URL: https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk (дата обращения: 14.01.2026).

LLM inference in C/C++ [Электронный ресурс] // GitHub. – URL: https://github.com/ggml-org/llama.cpp (дата обращения: 22.01.2026).

RKLLM-Toolkit is a software development kit for users to perform model conversionand quantization on PC [Электронный ресурс] // GitHub. – URL: https://github.com/airockchip/rknn-llm (дата обращения: 22.01.2026).

Vaswani A. et al. Attention is all you need // Advances in neural information processing systems. – 2017. – Т. 30.

Tensor library for machine learning [Электронный ресурс] // GitHub. – URL: https://github.com/ggml-org/ggml (дата обращения: 04.02.2026).

RK3588 Brief Datasheet.pdf [Электронный ресурс] – URL:https://www.rock-chips.com/uploads/pdf/2022.8.26/192/RK3588%20Brief%20Datasheet.pdf (дата обращения: 05.02.2026).

Nagel M. et al. A white paper on neural network quantization //arXiv preprint arXiv:2106.08295. – 2021.

Ashkboos S. et al. Quarot: Outlier-free 4-bit inference in rotated llms // Advances in Neural Information Processing Systems. – 2024. – Т. 37. – P. 100213–100240.

Qin R. et al. Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter //arXiv preprint arXiv:2604.15039. – 2026.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность Monetec 2026 СНЭ

ISSN: 2307-8162