Unsupervised anomaly detection on cybersecurity data streams: a case with BETH dataset

Evgeniy Eremin

Unsupervised anomaly detection on cybersecurity data streams: a case with BETH dataset

Evgeniy Eremin

Abstract

In modern world the importance of cybersecurity of various systems is increasing from year to year. The number of information security events generated by information security tools grows up with the development of the IT infrastructure. At the same time, the cyber threat landscape does not remain constant, and monitoring should take into account both already known attack indicators and those for which there are no signature rules in information security products of various classes yet. Detecting anomalies in large cybersecurity data streams is a complex task that, if properly addressed, can allow for timely response to atypical and previously unknown cyber threats. The possibilities of using of offline algorithms may be limited for a number of reasons related to the time of training and the frequency of retraining. Using stream learning algorithms for solving this task is capable of providing near-real-time data processing. This article examines the results of ten algorithms from three Python stream machine-learning libraries on BETH dataset with cybersecurity events, which contains information about the creation, cloning, and destruction of operating system processes collected using extended eBPF. ROC-AUC metric and total processing time of processing with these algorithms are presented. Several combinations of features and the order of events are considered. In conclusion, some mentions are given about the most promising algorithms and possible directions for further research are outlined.

Full Text:

PDF

References

Bouman, R., Bukhsh, Z., & Heskes, T. (2024). Unsupervised anomaly detection algorithms on real-world data: how many do we need?. Journal of Machine Learning Research, 25(105), 1-34.

Lu, T., Wang, L., & Zhao, X. (2023). Review of Anomaly Detection Algorithms for Data Streams. Applied Sciences, 13(10), 6353.

Sánchez-Zas, C., Larriva-Novo, X., Villagrá, V. A., Rodrigo, M. S., & Moreno, J. I. (2022). Design and Evaluation of Unsupervised Machine Learning Models for Anomaly Detection in Streaming Cybersecurity Logs. Mathematics, 10(21), 4043.

Heigl, M., Weigelt, E., Fiala, D., & Schramm, M. (2021). Unsupervised Feature Selection for Outlier Detection on Streaming Data to Enhance Network Security. Applied Sciences, 11(24), 12073.

Tuor, A., Kaplan, S., Hutchinson, B., Nichols, N., & Robinson, S. (2017, February). Deep Learning for Unsupervised Insider Threat Detection in Structured Cybersecurity Data Streams. In AAAI Workshops (pp. 224-231).

Artioli, P., Maci, A., & Magrì, A. (2024). A comprehensive investigation of clustering algorithms for User and Entity Behavior Analytics. Frontiers in big Data, 7, 1375818.

Almodovar, C., Sabrina, F., Karimi, S., & Azad, S. (2024). LogFiT: Log anomaly detection using fine-tuned language models. IEEE Transactions on Network and Service Management, 21(2), 1715-1723.

Gorokhov, O., Petrovskiy, M., Mashechkin, I., & Kazachuk, M. (2023). Fuzzy CNN Autoencoder for Unsupervised Anomaly Detection in Log Data. Mathematics, 11(18), 3995.

Kotenko, I. V., Melnik, M. V., & Abramenko, G. T. (2024, June). Anomaly Detection in Container Systems: Using Histograms of Normal Processes and an Autoencoder. In 2024 IEEE 25th International Conference of Young Professionals in Electron Devices and Materials (EDM) (pp. 1930-1934). IEEE.

Highnam, K., Arulkumaran, K., Hanif, Z., & Jennings, N. R. (2021). Beth dataset: Real cybersecurity data for unsupervised anomaly detection research. In CEUR Workshop Proc (Vol. 3095, pp. 1-12).

Lakha, B., Mount, S. L., Serra, E., & Cuzzocrea, A. (2022, December). Anomaly detection in cybersecurity events through graph neural network and transformer based model: A case study with beth dataset. In 2022 IEEE International Conference on Big Data (Big Data) (pp. 5756-5764). IEEE.

Sushmakar, N., Oberoi, N., Gupta, S., & Arora, A. (2022, June). An unsupervised based enhanced anomaly detection model using features importance. In 2022 2nd International Conference on Intelligent Technologies (CONIT) (pp. 1-7). IEEE.

Khan, L. P., Hossain, A., & Dey, S. (2023, February). Anomaly Detection for Beth Dataset Using Machine Learning Approaches. In 2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT) (pp. 1-6). IEEE.

Security Observability with eBPF, Natália Réka Ivánkó and Jed Salazar, O'Reilly, 2022

Montiel, J., Halford, M., Mastelini, S. M., Bolmier, G., Sourty, R., Vaysse, R., Bifet, A. (2021). River: machine learning for streaming data in python. Journal of Machine Learning Research, 22(110), 1-8.

Yilmaz, S. F., & Kozat, S. S. (2020). PySAD: A streaming anomaly detection framework in python. arXiv preprint arXiv:2009.02572.

Xu, J., Lin, C., Liu, F., Wang, Y., Xiong, W., Li, Z., ... & Xie, G. (2023). StreamAD: A cloud platform metrics-oriented benchmark for unsupervised online anomaly detection. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 3(2), 100121.

Ding, Z., & Fei, M. (2013). An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window. IFAC Proceedings Volumes, 46(20), 12-17.

Zhao, Y., Nasrullah, Z., & Li, Z. (2019). Pyod: A python toolbox for scalable outlier detection. Journal of machine learning research, 20(96), 1-7.

Pokrajac, D., Lazarevic, A., & Latecki, L. J. (2007, March). Incremental local outlier detection for data streams. In 2007 IEEE symposium on computational intelligence and data mining (pp. 504-515). IEEE.

Mirsky, Y., Doitshman, T., Elovici, Y., & Shabtai, A. (2018). Kitsune: an ensemble of autoencoders for online network intrusion detection. arXiv preprint arXiv:1802.09089.

Pevný, T. (2016). Loda: Lightweight on-line detector of anomalies. Machine Learning, 102, 275-304.

Guha, S., Mishra, N., Roy, G., & Schrijvers, O. (2016, June). Robust random cut forest based anomaly detection on streams. In International conference on machine learning (pp. 2712-2721). PMLR.

Sathe, S., & Aggarwal, C. C. (2016, December). Subspace outlier detection in linear time with randomized hashing. In 2016 IEEE 16th International Conference on Data Mining (ICDM) (pp. 459-468). IEEE.

Angiulli, F., & Fassetti, F. (2007, November). Detecting distance-based outliers in streams of data. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (pp. 811-820).

Manzoor, E., Lamba, H., & Akoglu, L. (2018, July). XStream: Outlier detection in feature-evolving data streams. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1963-1972).

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. The Journal of machine Learning research, 12, 2825-2830.

Refbacks

There are currently no refbacks.

Abava Кибербезопасность Monetec 2026 СНЭ

ISSN: 2307-8162

International Journal of Open Information Technologies