Study of the collapse of language models in medical applications during recursive and cross-training on artificial data

E.V. Bobrova; E.V. Dyuldin; K.S. Zaitsev; A.Zh. Makanov; I.A. Kuznetsov; D.D. Sharipov; A.A. Trukhin; E.A. Troshina

Study of the collapse of language models in medical applications during recursive and cross-training on artificial data

E.V. Bobrova, E.V. Dyuldin, K.S. Zaitsev, A.Zh. Makanov, I.A. Kuznetsov, D.D. Sharipov, A.A. Trukhin, E.A. Troshina

Abstract

The purpose of this article is to study the phenomenon of collapse of language models when implementing recursive and cross-sectional approaches to training models of the next generations in ultrasound diagnosis and treatment of the thyroid gland. In the first approach, each new model is trained exclusively on data generated by the previous version of the model, allowing the accumulation of systematic errors and degradation of data diversity to be examined. In the second, data generated by one model is used to train another model, which minimizes the impact of accumulated errors and allows a wider range of information to be stored. In the experiments conducted, Mistral and LLaMA models are trained and changes in data distribution are analyzed using the KL-distance metric, which evaluates the differences between the original data distribution and the data generated by the models. The results show that recursive learning causes a significant reduction in the range of generated text, especially for the LLaMA model, while cross-training exhibits greater resistance to collapse, providing more stable data diversity. The architectural differences of the models are considered, such as optimization of attention and the ability to model long-term dependencies that affect learning outcomes. The influence of different training methods on the ability of medical language models to preserve the variety and quality of the generated text is analyzed.

Full Text:

PDF (Russian)

References

Shumailov I, Shumaylov Z, Zhao Y, Papernot N, Anderson R, Gal Y. AI models collapse when trained on recursively generated data. Nature. 2024 Jul;631(8022):755-759. doi: 10.1038/s41586-024-07566-y. Epub 2024 Jul 24. PMID: 39048682; PMCID: PMC11269175.

Shumailov I. et al. The curse of recursion: Training on generated data makes models forget //arXiv preprint arXiv:2305.17493. – 2023

Gerstgrasser M. et al. Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data //arXiv preprint arXiv:2404.01413. – 2024.Wahdan, A., Salloum, S.A., Shaalan, K. (2022). Qualitative Study in Natural Language Processing: Text Classification. In: Al-Emran, M., Al-Sharafi, M.A., Al-Kabi, M.N., Shaalan, K. (eds) Proceedings of International Conference on Emerging Technologies and Intelligent Systems. ICETIS 2021. Lecture Notes in Networks and Systems, vol 322. Springer

Wang, Z., Ezukwoke, K., Hoayek, A. et al. Natural language processing (NLP) and association rules (AR)-based knowledge extraction for intelligent fault analysis: a case study in semiconductor industry. J Intell Manuf (2023).

Wei J. et al. Emergent abilities of large language models //arXiv preprint arXiv:2206.07682. – 2022.)

Prusty, S., Patnaik, S., Sahoo, G., Rautaray, J., Prusty, S.G.P. (2024). Unstructured Text Classification Using NLP and LSTM Algorithms. In: Nakamatsu, K., Patnaik, S., Kountchev, R. (eds) AI Technologies and Virtual Reality. AIVR 2023. Smart Innovation, Systems and Technologies, vol 382. Springer

Bender E. M. et al. On the dangers of stochastic parrots: Can language models be too big? //Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. – 2021. – С. 610-623.

Smith D, Campos A, Knipe H, et al. European Thyroid Association TIRADS. Reference article, Radiopaedia.org (Accessed on 11 Dec 2024) https://doi.org/10.53347/rID-68341

Sutton R. The bitter lesson //Incomplete Ideas (blog). – 2019. – Т. 13. – №. 1. – С. 38

Jain S. M. Hugging face //Introduction to transformers for NLP: With the hugging face library and models to solve problems. – Berkeley, CA : Apress, 2022. – С. 51-67Pedregosa F. et al. Scikit-learn: Machine learning in Python //the Journal of machine Learning research. – 2011. – Т. 12. – С. 2825-2830.

Contreras-Reyes J. E., Arellano-Valle R. B. Kullback–Leibler divergence measure for multivariate skew-normal distributions //Entropy. – 2012. – Т. 14. – №. 9. – С. 1606-1626

Refbacks

There are currently no refbacks.

Abava Кибербезопасность ИБП для ЦОД СНЭ

ISSN: 2307-8162

International Journal of Open Information Technologies