Hiroshi Inoue  Hiroshi Inoue photo       

contact information

Ph.D., Research Staff Member
IBM Research - Tokyo


Professional Associations

Professional Associations:  ACM SIGPLAN  |  IEEE Computer Society  |  Information Processing Society of Japan (IPSJ)

"SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures"
Hiroshi Inoue and Kenjiro Taura
PVLDB 8(11), pp 1274-1285, to present in 41st International Conference on Very Large Data Bases (VLDB)

Full text [PDF at Publisher web page]: p1274-inoue.pdf
Slides [PDF]: VLDB2015_SIMDsort_slides.pdf
Poster [PDF]: SIMDSorting_A0.pdf
Casual explanation (in Japanese): blog post

This paper describes our new algorithm for sorting an array of structures by efficiently exploiting the SIMD instructions and the cache memory of today's processors. Recently, multiway mergesort implemented with SIMD instructions is used as a high performance in-memory sorting algorithm for sorting integer values. For sorting an array of structures with SIMD instructions, a frequently used approach is to first pack the key and the index for each record into an integer value, sort the key-index pairs using SIMD instructions, and then rearrange the records based on the sorted key-index pairs. This approach can efficiently exploit SIMD instructions because it sorts the key-index pairs while packed into integer values and hence it can use existing high-performance sorting implementations of the SIMD-based multiway mergesort for integers. However, this approach has frequent cache misses in the final rearranging phase due to its random and scattered memory accesses so that this phase limits both the single-thread performance and the scalability with multiple cores. Our new approach is also based on the multiway mergesort, but it can avoid costly random accesses for rearranging the records while it still efficiently exploits the SIMD instructions. Our results showed that our new approach achieved up to 2.1x better single-thread performance than the key-index approach implemented with SIMD instructions when sorting 512M 16-byte records on one core. Our approach also yielded better performance when we used multiple cores. Compared to an optimized radix sort, our vectorized mergesort achieved better performance when the each record is large. Also our algorithm yielded higher scalability with multiple cores than the radix sort.