Exploring Fine-grained Task Parallelism on Simultaneous Multithreading Cores

Denis Los, Igor Petushkov

Abstract


Nowadays, latency-critical, high-performance applications are parallelized even on power-constrained client systems to improve performance. However, an important scenario of fine-grained tasking on simultaneous multithreading CPU cores in such systems has not been well researched in previous works. Hence, in this paper, we conduct performance analysis of state-of-the-art shared-memory parallel programming frameworks on simultaneous multithreading cores using real-world fine-grained application kernels. We introduce a specialized and simple software-only parallel programming framework called Relic to enable extremely fine-grained tasking on simultaneous multithreading cores. Using Relic framework, we increase performance speedups over serial implementations of benchmark kernels by 19.1% compared to LLVM OpenMP, by 31.0% compared to GNU OpenMP, by 20.2% compared to Intel OpenMP, by 33.2% compared to X-OpenMP, by 30.1% compared to oneTBB, by 23.0% compared to Taskflow, and by 21.4% compared to OpenCilk.

Full Text:

PDF

References


D. M. Tullsen, S. J. Eggers, and H. M. Levy, "Simultaneous multithreading: maximizing on-chip parallelism," in Proc. 22nd Annual International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, 1995, pp. 392-403.

D. T. Marr et al., “Hyper-Threading technology architecture and microarchitecture,” Intel Technology Journal, vol. 6, no. 1, pp. 4-15, 2002.

D. Koufaty and D. T. Marr, "Hyperthreading technology in the netburst microarchitecture," IEEE Micro, vol. 23, no. 2, pp. 56-65, March-April 2003, DOI: 10.1109/MM.2003.1196115.

Y. Zhai, X. Zhang, S. Eranian, L. Tang, and J. Mars, “HaPPy: hyperthread-aware power profiling dynamically,” in Proc. of the 2014 USENIX Conference on USENIX Annual Technical Conference, Philadelphia, PA, USA, 2014, pp. 211-218.

T. Leng, R. Ali, J. Hsieh, V. Mashayekhi, and R. Rooholamini, “An empirical study of hyper-threading in high performance computing clusters,” Linux HPC Revolution, Article ID 45, 2002.

L. Pons et al., “Effect of hyper-threading in latency-critical multithreaded cloud applications and utilization analysis of the major system resources,” Future Gener. Comput. Syst., vol. 131, pp. 194-208, June 2022.

N. Tuck and D. M. Tullsen, "Initial observations of the simultaneous multithreading Pentium 4 processor," in 2003 12th International Conference on Parallel Architectures and Compilation Techniques, New Orleans, LA, USA, 2003, pp. 26-34, DOI: 10.1109/PACT.2003.1237999.

D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy, "Supporting fine-grained synchronization on a simultaneous multithreading processor," in Proc. Fifth International Symposium on High-Performance Computer Architecture, Orlando, FL, USA, 1999, pp. 54-58, DOI: 10.1109/HPCA.1999.744326.

X. Qian, B. Sahelices, and J. Torrellas, "BulkSMT: Designing SMT processors for atomic-block execution," in IEEE International Symposium on High-Performance Comp Architecture, New Orleans, LA, USA, 2012, pp. 1-12, DOI: 10.1109/HPCA.2012.6168952.

N. Anastopoulos and N. Koziris, "Facilitating efficient synchronization of asymmetric threads on hyper-threaded processors," in 2008 IEEE International Symposium on Parallel and Distributed Processing, Miami, FL, USA, 2008, pp. 1-8, DOI: 10.1109/IPDPS.2008.4536358.

J. L. Kihm and D. A. Connors, "Implementation of fine-grained cache monitoring for improved SMT scheduling," in IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings., San Jose, CA, USA, 2004, pp. 326-331, DOI: 10.1109/ICCD.2004.1347941.

L. Dagum and R. Menon, "OpenMP: an industry standard API for shared-memory programming," IEEE Computational Science and Engineering, vol. 5, no. 1, pp. 46-55, Jan.-March 1998, DOI: 10.1109/99.660313.

E. Ayguade et al., "The design of OpenMP tasks," IEEE Transactions on Parallel and Distributed Systems, vol. 20, no. 3, pp. 404-418, March 2009, DOI: 10.1109/TPDS.2008.105.

O. A. R. Board, OpenMP, Support for the OpenMP language, 2024. [Online]. Available: https://openmp.llvm.org.

G.team, Gomp: An openmp implementation for gcc, 2024. [Online]. Available: https://gcc.gnu.org/projects/gomp.

P. Nookala, K. Chard, I. Raicu, “X-OpenMP — eXtreme fine-grained tasking using lock-less work stealing,” Future Generation Computer Systems, vol. 159, pp. 444-458, 2024, DOI: 10.1016/j.future.2024.05.019.

S. Iwasaki, A. Amer, K. Taura, S. Seo and P. Balaji, "BOLT: optimizing OpenMP parallel regions with user-level threads," in 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT), Seattle, WA, USA, 2019, pp. 29-42, DOI: 10.1109/PACT.2019.00011.

A. Kukanov, M. J. Voss, “The foundations for scalable multi-core software in Intel Threading Building Blocks.,” Intel Technology Journal, vol. 11, no. 4, p. 309, 2007.

T. -W. Huang, Y. Lin, C. -X. Lin, G. Guo, and M. D. F. Wong, "Cpp-Taskflow: a general-purpose parallel task programming system at scale," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 40, no. 8, pp. 1687-1700, Aug. 2021, DOI: 10.1109/TCAD.2020.3025075.

M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Torquati, “Fastflow: high-Level and efficient streaming on multicore,” Programming Multicore and Many-Core Computing Systems, Wiley-Blackwell, 2017, pp. 261-280, DOI: 10.1002/9781119332015.ch13.

T. B. Schardl and I-T. A. Lee, “OpenCilk: a modular and extensible software infrastructure for fast task-parallel code”, in Proc. of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Montreal, QC, Canada, 2024, pp. 189-203, DOI: 10.1145/3572848.3577509.

Barcelona Supercomputing Center, OmpSs-2 Specification, 2024. [Online]. Available: https://pm.bsc.es/ftp/ompss-2/doc/spec.

L.V. Kale and S. Krishnan, “CHARM++: a portable concurrent object oriented system based on C++”, in Proc. of the Eighth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications, Washington, D.C., USA, 1993, pp. 91-108, DOI: 10.1145/165854.165874.

A. Podobas, M. Brorsson, and K.-F. Faxén, “A comparison of some recent task-based parallel programming models,” in 3rd workshop on programmability issues for multi-core computers, Pisa, Italy, 2010.

G.W. Price, D. K. Lowenthal, “A comparative analysis of fine-grain threads packages,” Journal of Parallel and Distributed Computing, vol. 63, no. 11, pp. 1050-1063, 2003.

K. Wheeler, D. Stark, and R. Murphy, “A comparative critical analysis of modern task-parallel runtimes,” Sandia National Laboratories, Albuquerque, New Mexico, USA, SAND2012-10594, Dec. 2012.

A. Podobas, M. Brorsson, and K.-F. Faxen, “A comparative performance study of common and popular task-centric programming frameworks,” Concurr. Comput.: Pract. Exper., vol. 27, no. 1, pp. 1-28, Jan. 2015, DOI: 10.1002/cpe.3186.

G. Zeng, “Performance analysis of parallel programming models for C++,” J. Phys.: Conf. Ser., vol. 2646, 2023, DOI: 10.1088/1742-6596/2646/1/012027.

E. Ajkunic, H. Fatkic, E. Omerovic, K. Talic, and N. Nosovic, "A comparison of five parallel programming models for C++," in 2012 Proceedings of the 35th International Convention MIPRO, Opatija, Croatia, 2012, pp. 1780-1784.

A. Leist, A. Gilman, “A comparative analysis of parallel programming models for C++,” in Proc. of The Ninth International Multi-Conference on Computing in the Global Information Technology, Seville, Spain, 2014, pp. 121-127.

C. D. Krieger, M. M. Strout, J. Roelofs, and A. Bajwa, "Executing optimized irregular applications using task graphs within existing parallel models," in 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Salt Lake City, UT, USA, 2012, pp. 261-268, DOI: 10.1109/SC.Companion.2012.43.

L.M. Sanchez, J. Fernandez, R. Sotomayor, S. Escolar, J.D. Garcia, “A comparative study and evaluation of parallel programming models for shared-memory parallel architectures,” New Gener. Comput., vol. 31, pp. 139–161, 2013, DOI: 10.1007/s00354-013-0301-5.

S. Salehian, Jiawen Liu, and Yonghong Yan, "Comparison of Threading Programming Models," in 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Lake Buena Vista, FL, USA, 2017, pp. 766-774, DOI: 10.1109/IPDPSW.2017.141.

W. Heirman, T. E. Carlson, K. Van Craeynest, I. Hur, A. Jaleel, L. Eeckhout, “Automatic SMT threading for OpenMP applications on the Intel Xeon Phi co-processor,” in Proc. of the 4th International Workshop on Runtime and Operating Systems for Supercomputers, Munich, Germany, 2014, Article 7, DOI: 10.1145/2612262.2612268

X. Tian, Y.-K. Chen, M. Girkar, S. Ge, R. Lienhart and S. Shah, "Exploring the use of Hyper-Threading technology for multimedia applications with Intel OpenMP compiler," in Proc. International Parallel and Distributed Processing Symposium, Nice, France, 2003, DOI: 10.1109/IPDPS.2003.1213118.

Y.-K. Chen, M. Holliman, E. Debes, S. Zheltov, A. Knyazev, S. Bratanov, R. Belenov, and I. Santos, “Media applications on Hyper-Threading technology,” Intel Technology Journal, vol. 6, no. 1, pp. 47-57, 2002.

Y.-K. Chen, M. Holliman, and E. Debes, "Video applications on hyper-threading technology," in Proc. IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, 2002, pp. 193-196, vol. 2, DOI: 10.1109/ICME.2002.1035546.

R. Schöne, D. Hackenberg, and D. Molka, “Simultaneous multithreading on x86_64 systems: an energy efficiency evaluation,” in Proc. of the 4th Workshop on Power-Aware Computing and Systems, Cascais, Portugal, 2011, Article 10, DOI: 10.1145/2039252.2039262.

E. Athanasaki, N. Anastopoulos, K. Kourtis, N. Koziris, “Exploring the performance limits of simultaneous multithreading for memory intensive applications,” The Journal of Supercomputing, vol. 44, pp. 64-97, 2008, DOI: 10.1007/s11227-007-0149-x.

E. Athanasaki, N. Anastopoulos, K. Kourtis, N. Koziris, “Exploring the capacity of a modern SMT architecture to deliver high scientific application performance,” in Proc. of the 2006 International Conference on High Performance Computing and Communications, Munich, Germany, 2006, pp. 180-189, DOI: 10.1007/11847366_19.

R. E. Grant and A. Afsahi, "A Comprehensive Analysis of OpenMP Applications on Dual-Core Intel Xeon SMPs," in 2007 IEEE International Parallel and Distributed Processing Symposium, Long Beach, CA, USA, 2007, pp. 1-8, DOI: 10.1109/IPDPS.2007.370682.

S. Ivanikovas and G. Dzemyda, “Evaluation of the hyper‐threading technology for heat conduction‐type problems,” Mathematical Modeling and Analysis, vol. 12, no. 4, pp. 459-468, Dec. 2007.

M. Curtis-Maury, X. Ding, C.D. Antonopoulos, D.S. Nikolopoulos, “An evaluation of OpenMP on current and emerging multithreaded/multicore processors,” in Proc. of the First International Workshop on OpenMP Shared Memory Parallel Programming, Eugene, OR, USA, 2005, pp. 133-144.

H. Jin, M. Frumkin, and J. Yan, “The OpenMP implementation of NAS Parallel Benchmarks and its performance,” NASA Ames Research Center, Technical Report, Oct. 1999.

V. Aslot, M. J. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones, and B. Parady, “Specomp: A new benchmark suite for measuring parallel computer performance,” in Proc. of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming, London, UK, 2001, pp. 1-10

J. L. Henning, “SPEC CPU2006 benchmark descriptions,” SIGARCH Comput. Archit. News, vol. 34, no. 4, pp. 1-17, Sep. 2006, DOI: 10.1145/1186736.1186737.

J. D. Collins et al., "Speculative precomputation: long-range prefetching of delinquent loads," in Proc. 28th Annual International

Symposium on Computer Architecture

, Gothenburg, Sweden, 2001, pp. 14-25, DOI: 10.1109/ISCA.2001.937427.

A. Gontmakher, A. Mendelson, A. Schuster and G. Shklover, "Speculative synchronization and thread management for fine granularity threads," in The Twelfth International Symposium on High-Performance Computer Architecture, 2006., Austin, TX, USA, 2006, pp. 278-287, DOI: 10.1109/HPCA.2006.1598136.

J. Redstone, S. Eggers and H. Levy, "Mini-threads: increasing TLP on small-scale SMT processors," in The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings., Anaheim, CA, USA, 2003, pp. 19-30, DOI: 10.1109/HPCA.2003.1183521.

M. Abeydeera, S. Subramanian, M. C. Jeffrey, J. Emer and D. Sanchez, "SAM: Optimizing Multithreaded Cores for Speculative Parallelism," in 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), Portland, OR, USA, 2017, pp. 64-78, DOI: 10.1109/PACT.2017.37.

K.-F. Faxén, “Wool-a work stealing library,” SIGARCH Comput. Archit. News, vol. 36, no. 5, pp. 93-100, Dec. 2008, DOI: 10.1145/1556444.1556457.

R. Rangan et al., “Speculative Decoupled Software Pipelining,” in Proc. of the 16th International Conference on Parallel Architecture and Compilation Techniques, Brasov, Romania, 2007, pp. 49-59.

M. C. Jeffrey, S. Subramanian, C. Yan, J. Emer and D. Sanchez, "A scalable architecture for ordered parallelism," in Proc. of the 48th Annual IEEE/ACM International Symposium on Microarchitecture, Waikiki, HI, USA, 2015, pp. 228-241, DOI: 10.1145/2830772.2830777.

M. C. Jeffrey, S. Subramanian, C. Yan, J. Emer and D. Sanchez, "Unlocking Ordered Parallelism with the Swarm Architecture," IEEE Micro, vol. 36, no. 3, pp. 105-117, May-June 2016, DOI: 10.1109/MM.2016.12.

S. Kumar, C. J. Hughes, and A. Nguyen, “Carbon: architectural support for fine-grained parallelism on chip multiprocessors,” in Proc. of the 34th Annual International Symposium on Computer Architecture, San Diego, California, USA, 2007, pp. 162-173, DOI: 10.1145/1250662.1250683.

S. Saini, A. Naraikin, R. Biswas, D. Barkai and T. Sandstrom, "Early performance evaluation of a "Nehalem" cluster using scientific and engineering applications," in Proc. of the Conference on High Performance Computing Networking, Storage and Analysis, Portland, OR, USA, 2009, pp. 1-12, DOI: 10.1145/1654059.1654084.

S. Beamer, K. Asanović, D. Patterson, “The GAP benchmark suite,” arXiv:1508.03619 [cs.DC], 2015.

Y. Shiloach and U. Vishkin, "An o(logn) parallel connectivity algorithm," Journal of Algorithms, vol. 3, no. 1, pp. 57-67, 1982.

RapidJSON library, 2024. [Online]. Available: https://rapidjson.org.

JSON Example, 2024. [Online]. Available: https://json.org/example.html.

L. Lamport, “Specifying Concurrent Program Modules,” ACM Trans. Program. Lang. Syst., vol. 5, no. 2, pp. 190-222, 1983, DOI: 10.1145/69624.357207.

P. P. C. Lee, T. Bu, and G. Chandranmenon, "A lock-free, cache-efficient multi-core synchronization mechanism for line-rate network traffic monitoring," in 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Atlanta, GA, USA, 2010, pp. 1-12, DOI: 10.1109/IPDPS.2010.5470368.

J. Giacomoni, T. Moseley, and M. Vachharajani, “FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue,” in Proc. of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Salt Lake City, UT, USA, 2008, pp. 43-52, DOI: 10.1145/1345206.1345215.

J. Wang, K. Zhang, X. Tang, and B. Hua, “B-Queue: Efficient and Practical Queuing for Fast Core-to-Core Communication,” Internation Journal of Parallel Programming, vol. 41, pp. 137-159, 2023, DOI: 10.1007/s10766-012-0213-x.

Boost.Lockfree, 2024, [Online]. Available: https://www.boost.org/doc/libs/1_85_0/doc/html/lockfree.html.

R. Marotta et al., "Mutable locks: Combining the best of spin and sleep locks," Concurrency and Computation: Practice and Experience, vol. 32, no. 22, 2020, DOI: 10.1002/cpe.5858.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность IT Congress 2024

ISSN: 2307-8162