Simulation Modeling of a Fault-Tolerant Computing Cluster with Two-Level Load Balancing and Container Virtualization
Abstract
Full Text:
PDF (Russian)References
Maenhaut P. J., Volckaert B., Ongenae V., De Turck F. Resource Management in a Containerized Cloud: Status and Challenges // Journal of Network and Systems Management. 2020. Vol. 28(2). P. 197–246. DOI: 10.1007/s10922-019-09504-0.
Kelton W., Sadowski R., Zupick N. Simulation with Arena. New York: McGraw-Hill Education, 2015. 640 p.
Law A. M. Simulation Modeling and Analysis. New York: McGraw-Hill Education, 2014. 784 p.
Kleinrock L. Queueing Systems. Vol. 1. New York: Wiley, 1975. 432 p.
Kumar D., Ravi V. A survey on fault tolerance in cloud computing // Journal of Cloud Computing. 2018. Vol. 7(1).
Matloff N. Introduction to Discrete-Event Simulation and the SimPy Language. Davis: University of California, 2008.
Avizienis A., Laprie J.-C., Randell B., Landwehr C. Basic concepts and taxonomy of dependable and secure computing // IEEE Trans. Dependable and Secure Computing. 2004. Vol. 1(1). P. 11–33.
Schroeder B., Gibson G. A large-scale study of failures in high-performance computing systems // IEEE Trans. Dependable and Secure Computing. 2010. Vol. 7(4). P. 337–350.
Bogatyrev V. A., Bogatyrev A. V., Bogatyrev S. V. Reliability assessment of cluster execution of real-time requests // Izv. Vyssh. Uchebn. Zaved. Priborostroenie. 2014. Vol. 57. No. 4. P. 46–48.
Bogatyrev V. A. Combinatorial-probabilistic assessment of reliability and fault tolerance of cluster systems // Pribory i Sistemy. Upravlenie, Kontrol', Diagnostika. 2006. No. 6. P. 21–26.
Bogatyrev V. A., Derkach A. N., Bogatyrev S. V. Timeliness of the Reserved Maintenance by Duplicated Computers of Heterogeneous Delay-Critical Stream // CEUR Workshop Proceedings. ISTMC, 2019. P. 26–36.
Hwang J., et al. IASO: A Framework for Mitigating the Impact of Fail-Slow in Distributed Storage Services // USENIX ATC’19. 2019.
Koutras M. A. Markov regenerative process model for performability evaluation of a computer cluster system // Reliability Engineering & System Safety. 2023.
Fung V. K., Bogatyrev V. A., Do M. K. Simulation model of a computing cluster with container virtualization // Vestnik Kompʹyuternykh i Informatsionnykh Tekhnologiy. 2025. Vol. 22. No. 8. P. 3–12. DOI: 10.14489/vkit.2025.08.pp.003-012
Dean J., Barroso L. The tail at scale // Communications of the ACM. 2013. Vol. 56(2). P. 74–80.
Fung V. K., Bogatyrev V. A. Experimental study of cluster performance with container virtualization // Izv. Vyssh. Uchebn. Zaved. Priborostroenie. 2024. Vol. 67. No. 8. P. 647–656. DOI: 10.17586/0021-3454-2024-67-8-647-656
Fung V. K., Bogatyrev V. A., Karmanovsky N. S., Le V. H. Probabilistic-temporal characteristics of a computer system with container virtualization // Nauchno-Tekhnicheskiy Vestnik IT, Mekhaniki i Optiki. 2024. Vol. 24. No. 2. P. 249–255. DOI: 10.17586/2226-1494-2024-24-2-249-255
Zhang T., Sharma U., Kapritsos M. Performal: Formal Verification of Latency Properties for Distributed Systems // Proc. ACM on Programming Languages. 2023. Vol. 7. Art. 121. P. 1–26. DOI: 10.1145/3591249.
Zhao K., Goyal P., Alizadeh M., Anderson T. E. Scalable Tail Latency Estimation for Data Center Networks // 20th USENIX NSDI. Boston, MA, 2023. P. 685–702.
Ledmi A., Bendjenna H., Hemam S. M. Fault Tolerance in Distributed Systems: A Survey // Proc. 3rd Intl. Conf. Pattern Analysis and Intelligent Systems (PAIS). IEEE, 2018. P. 1–5. DOI: 10.1109/PAIS.2018.8598484.
Gunawi H. S., Suminto R. O., Sears R. et al. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems // ACM Transactions on Storage. 2018. Vol. 14, no. 3. Art. 23. P. 1–26. DOI: 10.1145/3242086.
Lu R., Xu E., Zhang Y. et al. Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems // Proc. 21st USENIX Conference on File and Storage Technologies (FAST '23). Santa Clara: USENIX Association, 2023. P. 49–64. (Best Paper Award).
Lou C., Jing Y., Huang P. Demystifying and Checking Silent Semantic Violations in Large Distributed Systems // Proc. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI '22). Carlsbad: USENIX Association, 2022. P. 91–107.
Tirmazi M., Barker A., Deng N. et al. Borg: the Next Generation // Proc. 15th European Conference on Computer Systems (EuroSys '20). Heraklion: ACM, 2020. Art. 30. DOI: 10.1145/3342195.3387517.
Beyer B., Murphy N. R., Rensin D. K., Kawahara K., Thorne S. The Site Reliability Workbook: Practical Ways to Implement SRE. Sebastopol: O'Reilly Media, 2018. ISBN 978-1-491-92521-7.
Refbacks
- There are currently no refbacks.
Abava Кибербезопасность Monetec 2026 СНЭ
ISSN: 2307-8162