![]() ![]() ![]() add up the four float values from mmSum into a single value and return MmSum = _mm_add_ss(mmSum, _mm_load_ss(p + i)) add up single values until all elements are covered MmSum = _mm_add_ps(mmSum, _mm_loadu_ps(p + i)) unrolled loop that adds up 4 elements at a time copy the length of v and a pointer to the data onto the local stackĬonst float* p = (N > 0) ? &v.front() : NULL In this example we use the SSE3 horizontal add intrinsic _mm_hadd_ps. Since we accumulate the float values in four bins, we need to add those up as well before we can return the total sum over all elements of the vector v. When there are still elements left after all unrolled iterations of the loop, we add them up using the single-element _mm_add_ss. Now we us the loop-unrolling technique presented in an earlier article in order to sum up 4 elements in parallel using a single _mm_add_ps instruction. Note that it is generally not possible to load the data using an aligned load since the std::vector class does not allocate aligned memory. This article will show you how to access the elements of the vector using SSE intrinsics and accumulate all elements into a single value.įirst, you need to get a pointer to the data of the std::vector v in order to load multiple elements using the 128-bit _mm_loadu_ps intrinsic. But how do you combine a std::vector with SSE code? For example, you want to sum up each element of a float vector of arbitrary length (in C++ this corresponds to std::accumulate(v.begin(), v.end(), 0.0f)). On modern CPUs you can take advantage of vectorized instructions that allow you to operate on multiple data elements at the same time. In C++ it's very convenient to store array data using the std::vector from the STL library.
0 Comments
Leave a Reply. |