CSE 525: Parallel Computing

School of Computer Science and Engineering

California State University, San Bernardino

Last modified: Mon, June 10, 2019

Implement Reduction algorithm in CUDA for at most 1024 threads and at most 1024 array size. Since maximum block size on the lab computers in JBH-359 is 1024, your program should work for any of the following execution configurations:

1 block of 1024 threads

2 blocks of 512 threads each

256 blocks of 4 threads each

Refer to Figure 5.15.

Hand in a printout of your program, complete with a `main()`, and sample runs.

Solution to Homework 4, rectangular matrix multiplication RecMatMultTiled.cu.

Wednesday, June 5

1. Eric Blasko

2. Luis Alvarez

3. Andrew Price

Monday, June 10

1. Maira Castaneda Cervante

2. Jaejun Ka

Modify Tiled square matrix multiplication SquareMatMultTiled.cu to perform tiled

Refer to Figure 4.20 and the two paragraph preceding it on page 96.

Shared memory is not initialized, therefore lines 9 and 10 in Figure 4.20 must be changed to

if ((Row < Width) && (ph*TILE_WIDTH + tx) < Width) Mds[ty][tx] = M[Row*Width + ph*TILE_WIDTH + tx]; else Mds[ty[tx] = 0; if ((ph*TILE_WIDTH + ty) < Width && Col < Width) Nds[ty][tx] = N[(ph*TILE_WIDTH + ty)*Width + Col]; else Nds[ty[tx] = 0;so that threads that don't have a global memory value to load (in M and N matrices) would initialize their corresponding shared memory to 0.

Exercises 3 and 4 Page 68.

Exercises 6 and 7 Page 100.

Exercise 9 Page 101.

SquareMatMultTiled.cu CUDA tiled square matrix multiplication.

SquareMatMult.cu CUDA square matrix multiplication.

SquareMatMultSeq.c Sequential version of square matrix multiplication.

rowOrder.cu illustrates thread assignments to a 12 col x 8 row matrix.

Implement All Reduction algorithm in MPI. In All Reduction all processes will have the result of the reduction, therefore there is no root process.

Give your own allreduce() function with the following prototype:

void allreduce(int *local, int *sum, MPI_Comm comm)

allreduce() is performed on integers and the reduction operator is summation.

Give your own main() to test the correctness of your allreduce() function. Assume number of processes is a power of 2.

query.cu finds and reports on the properties of GPU cards.

Evolution of CUDA simple add program:

simpleVecAdd1.cu

simpleVecAdd2.cu

simpleVecAdd3.cu

simpleVecAdd4.cu

Evolution of Sieve of Eratosthenes program:

sieve.c

sieve1.c

sieve2.c

sieve3.c

Implement reduction algorithm in MPI. Number of processes p is NOT necessarily a power of 2. Root of the reduction could be any process (0 through p-1).

Give your own reduce() function with the following prototype:

void reduce(int *local, int *sum, int root, MPI_Comm comm)

The reduction is performed on integers and the reduction operator is summation.

Give your own main() to test the correctness of your reduce() function.

Three versions of Satisfiability problem:

sat1.c

sat2.c

sat3.c

hello.c is a simple MPI Program.

To run an MPI program on a network of workstations, MPI processes must be able to run on a remote machine without a password prompt. Perform the following steps to be able to login on a machine in JB358/JB359 without entering a password.

1. Login on jbh3-1.

2. Remove .ssh subdirectory:

$ rm -rf .ssh

3. Generate RSA key pairs:

$ ssh-keygen -t rsa

Hit the Return key three times to accept the default file name and empty passphrase.

4. Change protection mode for the newly created directory .ssh:

$ chmod 755 .ssh

5. Create authorized_keys file in .ssh:

$ cd .ssh

$ cp id_rsa.pub authorized_keys

$ cd

You are done! Now you need to login on each machine in JB358/JB359 once to add that machine name to the list of known_hosts. After a machine is added to this list, subsequent logins do not require a password--life is good!

These instructions are taken from Open MPI web site.

Give a routing algorithm for a 2D-mesh network with wrap around edges. The mesh consists of n rows and n edges (square mesh). The rows and columns are numbered 0 to n-1. Your algorithm must be efficient, that is, it must take the shortest path to the destination.

The parameters to your algorithm are (D_{i}, D_{j})
which represent destination coordinates (row i and column j).
These parameters are part of the message header. As the message is passed
from one node to another, each node examines its own coordinates
and the destination coordinates to pass the message to one of its neighbors.

Your algorithm must be precise and concise. The closer you're to a program, the better.

Presentation Evaluation Criteria

Syllabus