I’m learning how to use SIMD (Single Instruction, Multiple Data) for parallel data processing.
Suppose I have a large dataset (e.g., an array of 1 million floats), and I want to process it efficiently using SIMD instructions (like SSE, AVX, or NEON).
My question is: How do I decide how much data a processor or core should process in a SIMD program?
Do I just split the data evenly by the number of logical cores or SIMD lanes? Or are there other things to consider?
Additional context: • I know that SIMD operates on vectors (e.g., 128-bit or 256-bit registers), so I can process 4 or 8 floats at a time depending on the instruction set. • I’m not sure how this interacts with CPU cores or hardware threads. • Should I also consider cache size, memory bandwidth, or NUMA?
What are the best practices or heuristics to decide the data chunk size when designing a SIMD-parallel program?