My Journey To Understanding and Using SHMEM:

Lesson 6: Reduce Functions

Recall the dilemma we were in during lesson 4 and 5. We had on each processor a value that had to be sent to the root node, which would then be added to the count value. If you have touched MPI, this situation may remind you of the reduce function. Well, thankfully, we have something similar in SHMEM.

shmem_<TYPENAME>_<OP>_to_all( TYPE *dest, const TYPE *source, int nreduce, int PE_start, int logPE_stride, int PE_size, TYPE *pWrk, long *pSync )

where OP can be sum, min, and, or, xor, max, min.
Phew. To be honest, there is a lot to discuss with the above, so lets take this little by little.

arg0 - TYPE *dest: pointer to the destination array
arg1 - TYPE *source: pointer to the source array
arg2 - int nreduce: number of elements in array to be reduced

So far so good. We have dealt with the above before, and we have even come across the next 3 arguments as well while learning functions like shmem_barrier. Recall that shmem_barrier allowed for synchronization amongst a subset of PEs, rather than the whole set. The subset was defined by three traits: the lowest PE number in the set, the stride between the PE numbers in the set, and the number of PEs in the set. Similarly, we have the next three arguments:

arg3 - int PE_start: the lowest PE number in the set
arg4 - int logPE_stride: log_2 of the stride between the PE numbers in the set
arg5 - int PE_size: the number of PEs in the set

For, example, the set {1,5,9} would give PE_start=1, logPE_stride=2, PE_size=3.

You have also seen the seventh argument pSync before; however, we have not yet seen pWrk. So, lets take a few moments to understand the pointer *pWrk.

arg6 - TYPE *pWrk: a pointer to a symmetric array of type TYPE and size

pWrk is a pointer to an array of the same type as the destination and source. The reason being is that the array to which pWrk points is where the reduction takes place. That is, the values of source from each PE is placed in this array and then combined using the designated operator.

Because this array then has to hold the values while they are being reduced, the size of the array must be at least
\[\max( \text{nreduce}/2 + 1, \text{ SHMEM_REDUCE_MIN_WRKDATA_SIZE})\]

Unlike pSync which requires you to set each value of the array, pWrk can just be left alone until the reduction functions are called.

The last argument is pSync. (Again, recall its use in shmem_barrier.)

long *pSync: pointer to a symmetric array of type long and of size SHMEM_REDUCE_SYNC_SIZE. Each element must be set to SHMEM_SYNC_VALUE.

So, what does this mean for us. Before a reduction call, it is on the programmer to create two arrays: pWrk and pSync.

Approximating pi revisited:

Lets redo the monte carlo simulation of pi, and this time lets use shmem_long_sum_to_all.


#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <inttypes.h>

#include <shmem.h>

#define NUM 100000

double circle(double x){
	return 1 - pow(x,2);
}

int main(){
    int me, npes;

    //create needed arrays for the reduction call
    //can automatically do ShMEM_REDUCE_MIN_WRKDATA_SIZE since nreduce/2+1=1
    long pWrk[SHMEM_REDUCE_MIN_WRKDATA_SIZE];
    long pSync[SHMEM_REDUCE_SYNC_SIZE];
    for (int i = 0; i < SHMEM_REDUCE_SYNC_SIZE; i++){
        pSync[i] = SHMEM_SYNC_VALUE;
    }

    shmem_init();
    me = shmem_my_pe();
    npes = shmem_n_pes();

    static long count = 0;
    double f_x;

    //seed the randomizer
    srand(time(0)+me);
    for (int i = 0; i < NUM; i++){
        //generate point(x,y) in first quadrant
        //note: double is a 64 bit floating point
        double x = (double) rand() / (double) RAND_MAX;
        double y = (double) rand() / (double) RAND_MAX;
        f_x = circle(x);
        if (pow(y,2) <= f_x){
            count += 1;
        }
    }

    printf("%d: count %ld\n", me, count);

    static long total = 0; //needs to be in symmetric memory
    
    //note that we are only reducing one element
    //                      starting with PE 0
    //                      striding by 1 --> log_2(1) = 0
    shmem_long_sum_to_all(&total, &count, 1, 0, 0, npes, pWrk, pSync);   
 
    if (me == 0){
        printf("%d: count total: %ld\n", me, total);
        printf("%d: ratio: %f\n", me, (double)total/((double)NUM * npes));
        printf("%d: est of pi: %f\n", me, (double)total/((double)NUM * npes)*4);
    }
    return 0;
}

Things to note:

there was no need for a barrier. Why? Because when the PE comes out of the call to the reduction command, the destination array has been updated. In addition, pWrk and pSync are reset to the values they were prior.
every PE has the total number of counts, not just the root PE. This reduction was done across all the PEs. You may then ask like I did, is there a routine similar to just MPI's reduce (all to one). The answer is no. However, you can use the atomics if you would like.

Exercises:

Split the PEs into two groups (even and odd) and have the even estimate pi with a smaller NUM than the odd. Print off the numbers and compare.
Look up a method to use Monte Carlo to approximate e. (It is pretty sweet.) While still estimating pi, have each PE provide an additional count for e and pass that on at the same time (so, array is of size 2)
Implement a matrix-vector multiplication, where each PE receives one row of the matrix

void *shmem_malloc(size_t size)
void shmem_free(void *ptr)