SHMEMmer Wanna Be

My Journey To Understanding and Using SHMEM:

Lesson 6: Reduce Functions



Recall the dilemma we were in during lesson 4 and 5. We had on each processor a value that had to be sent to the root node, which would then be added to the count value. If you have touched MPI, this situation may remind you of the reduce function. Well, thankfully, we have something similar in SHMEM. where OP can be sum, min, and, or, xor, max, min.
Phew. To be honest, there is a lot to discuss with the above, so lets take this little by little. So far so good. We have dealt with the above before, and we have even come across the next 3 arguments as well while learning functions like shmem_barrier. Recall that shmem_barrier allowed for synchronization amongst a subset of PEs, rather than the whole set. The subset was defined by three traits: the lowest PE number in the set, the stride between the PE numbers in the set, and the number of PEs in the set. Similarly, we have the next three arguments: For, example, the set {1,5,9} would give PE_start=1, logPE_stride=2, PE_size=3.

You have also seen the seventh argument pSync before; however, we have not yet seen pWrk. So, lets take a few moments to understand the pointer *pWrk. pWrk is a pointer to an array of the same type as the destination and source. The reason being is that the array to which pWrk points is where the reduction takes place. That is, the values of source from each PE is placed in this array and then combined using the designated operator.

Because this array then has to hold the values while they are being reduced, the size of the array must be at least
\[\max( \text{nreduce}/2 + 1, \text{ SHMEM_REDUCE_MIN_WRKDATA_SIZE})\]

Unlike pSync which requires you to set each value of the array, pWrk can just be left alone until the reduction functions are called.

The last argument is pSync. (Again, recall its use in shmem_barrier.) So, what does this mean for us. Before a reduction call, it is on the programmer to create two arrays: pWrk and pSync.

Approximating pi revisited:

Lets redo the monte carlo simulation of pi, and this time lets use shmem_long_sum_to_all.


#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <inttypes.h>

#include <shmem.h>

#define NUM 100000

double circle(double x){
	return 1 - pow(x,2);
}

int main(){
    int me, npes;

    //create needed arrays for the reduction call
    //can automatically do ShMEM_REDUCE_MIN_WRKDATA_SIZE since nreduce/2+1=1
    long pWrk[SHMEM_REDUCE_MIN_WRKDATA_SIZE];
    long pSync[SHMEM_REDUCE_SYNC_SIZE];
    for (int i = 0; i < SHMEM_REDUCE_SYNC_SIZE; i++){
        pSync[i] = SHMEM_SYNC_VALUE;
    }

    shmem_init();
    me = shmem_my_pe();
    npes = shmem_n_pes();

    static long count = 0;
    double f_x;

    //seed the randomizer
    srand(time(0)+me);
    for (int i = 0; i < NUM; i++){
        //generate point(x,y) in first quadrant
        //note: double is a 64 bit floating point
        double x = (double) rand() / (double) RAND_MAX;
        double y = (double) rand() / (double) RAND_MAX;
        f_x = circle(x);
        if (pow(y,2) <= f_x){
            count += 1;
        }
    }

    printf("%d: count %ld\n", me, count);

    static long total = 0; //needs to be in symmetric memory
    
    //note that we are only reducing one element
    //                      starting with PE 0
    //                      striding by 1 --> log_2(1) = 0
    shmem_long_sum_to_all(&total, &count, 1, 0, 0, npes, pWrk, pSync);   
 
    if (me == 0){
        printf("%d: count total: %ld\n", me, total);
        printf("%d: ratio: %f\n", me, (double)total/((double)NUM * npes));
        printf("%d: est of pi: %f\n", me, (double)total/((double)NUM * npes)*4);
    }
    return 0;
}
Things to note: Exercises:
  1. Split the PEs into two groups (even and odd) and have the even estimate pi with a smaller NUM than the odd. Print off the numbers and compare.
  2. Look up a method to use Monte Carlo to approximate e. (It is pretty sweet.) While still estimating pi, have each PE provide an additional count for e and pass that on at the same time (so, array is of size 2)
  3. Implement a matrix-vector multiplication, where each PE receives one row of the matrix
  4. Note: you can use