My Journey To Understanding and Using SHMEM:
Lesson 6: Reduce Functions

Recall the dilemma we were in during lesson 4 and 5. We had on each processor a value that had to be sent to the root node, which would then be added to the count value. If you have touched MPI, this situation may remind you of the reduce function. Well, thankfully, we have something similar in SHMEM.
- shmem_<TYPENAME>_<OP>_to_all( TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync )
Phew. To be honest, there is a lot to discuss with the above, so lets take this little by little.
- arg0 - TYPE *dest: pointer to the destination array
- arg1 - TYPE *source: pointer to the source array
- arg2 - int nreduce: number of elements in array to be reduced
- arg3 - int PE_start: the lowest PE number in the set
- arg4 - int logPE_stride: log_2 of the stride between the PE numbers in the set
- arg5 - int PE_size: the number of PEs in the set
You have also seen the seventh argument pSync before; however, we have not yet seen pWrk. So, lets take a few moments to understand the pointer *pWrk.
- arg6 - TYPE *pWrk: a pointer to a symmetric array of type TYPE and size \[\max( \text{nreduce}/2 + 1, \text{ SHMEM_REDUCE_MIN_WRKDATA_SIZE})\]
Because this array then has to hold the values while they are being reduced, the size of the array must be at least
\[\max( \text{nreduce}/2 + 1, \text{ SHMEM_REDUCE_MIN_WRKDATA_SIZE})\]
Unlike pSync which requires you to set each value of the array, pWrk can just be left alone until the reduction functions are called.
The last argument is pSync. (Again, recall its use in shmem_barrier.)
- long *pSync: pointer to a symmetric array of type long and of size SHMEM_REDUCE_SYNC_SIZE. Each element must be set to SHMEM_SYNC_VALUE.
Approximating pi revisited:
Lets redo the monte carlo simulation of pi, and this time lets use shmem_long_sum_to_all.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <inttypes.h>
#include <shmem.h>
#define NUM 100000
double circle(double x){
return 1 - pow(x,2);
}
int main(){
int me, npes;
//create needed arrays for the reduction call
//can automatically do ShMEM_REDUCE_MIN_WRKDATA_SIZE since nreduce/2+1=1
long pWrk[SHMEM_REDUCE_MIN_WRKDATA_SIZE];
long pSync[SHMEM_REDUCE_SYNC_SIZE];
for (int i = 0; i < SHMEM_REDUCE_SYNC_SIZE; i++){
pSync[i] = SHMEM_SYNC_VALUE;
}
shmem_init();
me = shmem_my_pe();
npes = shmem_n_pes();
static long count = 0;
double f_x;
//seed the randomizer
srand(time(0)+me);
for (int i = 0; i < NUM; i++){
//generate point(x,y) in first quadrant
//note: double is a 64 bit floating point
double x = (double) rand() / (double) RAND_MAX;
double y = (double) rand() / (double) RAND_MAX;
f_x = circle(x);
if (pow(y,2) <= f_x){
count += 1;
}
}
printf("%d: count %ld\n", me, count);
static long total = 0; //needs to be in symmetric memory
//note that we are only reducing one element
// starting with PE 0
// striding by 1 --> log_2(1) = 0
shmem_long_sum_to_all(&total, &count, 1, 0, 0, npes, pWrk, pSync);
if (me == 0){
printf("%d: count total: %ld\n", me, total);
printf("%d: ratio: %f\n", me, (double)total/((double)NUM * npes));
printf("%d: est of pi: %f\n", me, (double)total/((double)NUM * npes)*4);
}
return 0;
}
Things to note:
- there was no need for a barrier. Why? Because when the PE comes out of the call to the reduction command, the destination array has been updated. In addition, pWrk and pSync are reset to the values they were prior.
- every PE has the total number of counts, not just the root PE. This reduction was done across all the PEs. You may then ask like I did, is there a routine similar to just MPI's reduce (all to one). The answer is no. However, you can use the atomics if you would like.
- Split the PEs into two groups (even and odd) and have the even estimate pi with a smaller NUM than the odd. Print off the numbers and compare.
- Look up a method to use Monte Carlo to approximate e. (It is pretty sweet.) While still estimating pi, have each PE provide an additional count for e and pass that on at the same time (so, array is of size 2)
- Implement a matrix-vector multiplication, where each PE receives one row of the matrix Note: you can use
- void *shmem_malloc(size_t size)
- void shmem_free(void *ptr)
