NPRG058: Advanced Programming in Parallel Environment

Assignment 2 - CUDA Histogram III (assignment specification)

Complete the histogram algorithm using the things you have learned in the last lecture. Namely, improve the data transfers between host and GPU beyond the baseline (single memcpy for each direction). Implement an iterative approach where data transfers overlap with computations (optionally use pinned memory). As a second attempt choose either memory mapping approach or unified memory (with explicit prefetching) approach.

Do not be disappointed that the data transfers take much longer than the actual kernel (actually the kernel takes much longer to execute when mapping/unified memory is used since the transfers are amortized in it). It is also part of the experience to remember that the data transfer cost is high, so we need to choose carefully which tasks are suitable for GPU and which ones are not.

Stretch goal is to utilize multiple GPUs in your solution. However, the final solution should use only 1 GPU (you may add CLI argument --gpus which defaults to 1 if not present, for instance).

Complete Solution

Finalized solution should be submitted to Git repositories as official assignment. Your solution needs to implement the following algorithms (please use the designated names in your CLI arguments):

serial - CPU sequential baseline (already implemented in given source codes)
naive - solution without any synchronization where one thread computes one histogram bin (should have been implemented in Episode I)
atomic - simple atomic updates, no privatization, no shared memory (should have been implemented in Episode I)
atomic_shm - atomic updates with private copy in shared memory (no privatization in global memory, --itemsPerThread indicates how many input characters are processed in one thread, and --privCopies indicates how many histogram copies are placed in shared memory)
overlap - same as atomic_shm, but the input data are transferred in chunks and overlap with computations (add CLI arguments: --chunkSize = number of characters in one chunk and --pinned = bool flag that indicates whether the chunk buffers have to be allocated as portable memory)
final - same as atomic_shm, but either using memory mapping or unified memory (your choice should be mentioned in readme file in your repository).

All algorithms should also reflect --blockSize that sets number of threads in each CUDA block.

Testing

To make our results directly comparable, the main measurements should be performed using lorem_small.txt input file, but with --repeatInput 32k option. That will cause the input to be 2 GiB large. Keep the size of the histogram at the default values.

Experiment with other parameters mentioned above (block sizes, number of private copies, ...). Update your CLI arguments so the best values are set as default.

Measure all algorithms on volta01 and create a table (in readme file) with measured times for all algorithms mentioned above (show kernel time and preparation time separately).