diff --git a/README.md b/README.md
index 0e38ddb..ef14151 100644
--- a/README.md
+++ b/README.md
@@ -3,12 +3,219 @@ CUDA Stream Compaction
**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
-* (TODO) YOUR NAME HERE
- * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Michael Rabbitz
+ * [LinkedIn](https://www.linkedin.com/in/mike-rabbitz)
+* Tested on: Windows 10, i7-9750H @ 2.60GHz 32GB, RTX 2060 6GB (Personal)
-### (TODO: Your README)
+## Part 1: Introduction
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This project implements various Stream Compaction algorithms, with an emphasis on those utilizing the Scan algorithm, to highlight the importance of designing GPU hardware-optimized algorithms that leverage parallel computation for superior performance.
+### Stream Compaction
+Stream compaction involves filtering an input data set to produce a new collection that contains only the elements meeting a specified condition, while preserving their original order. This process reduces the size of the data set by removing unwanted elements, which is crucial for optimizing performance and memory usage in applications like path tracing, collision detection, etc.
+
+
+
+
+
+### Scan
+The Scan algorithm, also known as the all-prefix-sums operation, computes prefix sums for an array of data. In an exclusive scan, each element in the result is the sum of all preceding elements in the input array. Conversely, an inclusive scan includes the current element in the sum. Scan algorithms are fundamental in parallel computing for tasks like sorting, stream compaction, and building data structures, as they transform inherently sequential computations into parallel ones.
+
+
+
+
+
+## Part 2: Implementation Details
+
+**All implementations support arrays of arbitrary size**
+
+### Scan
+#### Implementations
+1. **CPU:** O(n) addition operations - sequential loop over array elements, accumulating a sum at each iteration
+2. **GPU Naive Algorithm:** O(n * log2(n)) addition operations - over log2(n) passes, for pass p starting at p = 1, compute the partial sums of n - 2p - 1 elements in parallel
+3. **GPU Work-Efficient Algorithm:** O(n) operations - performs scan into two phases: parallel upsweep (reduction) with n - 1 adds (O(n)), and parallel downsweep with n - 1 adds (O(n)) and n - 1 swaps (O(n))
+4. **GPU Naive Algorithm with Hardware Efficiency:** uses shared memory - divide the array into blocks, then scan each block in parallel over many SMs (with each block corresponding to a thread block). Utilize shared memory within each thread block to perform the scan and write the total sum of each block to a new array of block sums. Scan the array of block sums to compute an array of block increments, which are then added to each element in the corresponding scanned block from the initial scan (e.g. the zero-indexed block increment from the block increments array is added to each element in the zero-indexed scanned block of the divided input array).
+5. **GPU Work-Efficient Algorithm with Hardware Efficiency:** uses shared memory - same process as above
+6. **GPU using [Thrust CUDA library](https://nvidia.github.io/cccl/thrust):** wrapper function using thrust::exclusive_scan
+
+|Inclusive Scan - GPU Naive Algorithm|
+|:--:|
+||
+
+|Exclusive Scan - GPU Work-Efficient Algorithm|
+|:--:|
+|Upsweep
|
+|
|
+|Downsweep
|
+||
+
+|Inclusive Scan - GPU with Hardware Efficiency using shared memory|
+|:--:|
+||
+
+### Stream Compaction
+**For this project, Stream Compaction is performed on an array of randomized non-negative integers, where elements with a value of 0 are considered 'invalid' and those with positive values are considered 'valid'.**
+
+#### Stream Compaction with Scan is described in 3 steps:
+1. **Create Binary Map:** use the input array to generate a binary map array indicating the validity of each input element (0 for invalid, 1 for valid)
+2. **Scan:** perform Scan on the binary map to generate an array of indices that indicate the compacted array positions of valid input elements
+3. **Scatter:** for each index in the binary map that indicates a valid input element - use the index to retrieve the valid element data from the input array, and use the same index to retrieve the compacted array index from the Scan output array. Use these two retrieved values to place the valid element data into the compacted array
+
+
+
+
+
+#### Implementations
+1. **CPU without Scan:** sequential loop over n input elements while placing valid input data into the compacted array
+2. **CPU with Scan:** Create Binary Map with sequential loop over n input elements, Scan using CPU Scan, then Scatter with sequential loop over n elements
+3. **GPU with Work-Efficient Scan:** Create Binary Map over n input elements in one parallel pass, Scan using Work-Efficient Scan, then Scatter over n elements in one parallel pass
+4. **GPU with Work-Efficient & Hardware-Efficient Scan:** same as above line except using Work-Efficient & Hardware-Efficient Scan
+5. **GPU using [Thrust CUDA library](https://nvidia.github.io/cccl/thrust):** wrapper function using thrust::remove_if
+
+## Part 3: Performance Analysis
+### Optimal Block Size
+**For each GPU Scan implementation, find the block size that gives the minimal runtime on the GPU**
+
+**Each data point is the average of 100 runtime samples**
+
+
+
+
+| Block Size | Naive (ms) | Naive & Hardware-Eff (ms) | Work-Eff | Work-Eff & Hardware-Eff |
+| ---------- | ---------- | ------------------------- | -------- | ----------------------- |
+|32 |191.8 |24 |47.6 |13.1 |
+|64 |116.9 |14.4 |47.9 |9.7 |
+|128 |116.8 |13.8 |47.4 |9.3 |
+|256 |116.8 |14 |47.8 |9.6 |
+|512 |116.8 |14.4 |48.3 |10.5 |
+|1024 |118.1 |16.6 |50.4 |12.4 |
+
+Based on the results, it is clear that block size of 128 is optimal.
+
+### Runtime
+**Includes the CPU Scan implementation, all GPU Scan implementations, and the Thrust Scan**
+
+**Each data point is the average of 100 runtime samples**
+
+
+
+
+| Array Size | CPU (ms) | Naive (ms) | Naive & Hardware-Eff (ms) | Work-Eff | Work-Eff & Hardware-Eff | Thrust (ms) |
+| ------------ | -------- |------------ | -------------------------- | --------- | ------------------------ | ------------ |
+|210|0.0005 |0.0332 |0.0626 |0.0786 |0.0742 |0.0670 |
+|212|0.0018 |0.0390 |0.0570 |0.1122 |0.0766 |0.1143 |
+|214|0.0068 |0.0494 |0.0589 |0.1236 |0.0598 |0.0514 |
+|216|0.0241 |0.0756 |0.1145 |0.1271 |0.0776 |0.0793 |
+|218|0.1092 |0.1705 |0.2576 |0.1742 |0.1362 |0.1640 |
+|220|0.5457 |0.8490 |0.3955 |0.4744 |0.2322 |0.2773 |
+|222|2.8498 |2.9690 |0.7290 |1.4791 |0.4043 |0.3592 |
+|224|13.7399 |12.8068 |1.9275 |5.9673 |1.2438 |0.6996 |
+
+Based on the results, CPU Scan is the fastest up to an array size of 218.
+Beyond this size, all GPU Scans, except for Naive Scan, exhibit exponential speedups relative to CPU Scan.
+Work-Efficient Scan, although much faster than both CPU Scan and Naive Scan at large array sizes, begins to experience exponentially longer run times as we reach these sizes.
+The results of Naive & Hardware-Efficient Scan and Work-Efficient & Hardware-Efficient Scan demonstrate the importance of effectively utilizing shared memory (e.g. avoiding bank conflicts) and incorporating warp partitioning.
+As we approach the largest tested array sizes, Thrust clearly pulls ahead, but I am proud of how my Work-Efficient & Hardware-Efficient Scan competes closely with it!
+
+
+### Performance Bottlenecks
+
+The Work-Efficient Scan is clearly faster than the Naive Scan due to the reduced overall work involved, as described in the implementation details in the [Part 2](#part-2-implementation-details) Scan section.
+However, both implementations suffer from a significant number of read and write operations to and from global memory, resulting in poor performance compared to Hardware-Efficient implementations that utilize shared memory.
+
+Another inefficiency of the Naive Scan is the absence of warp partitioning best practices, which the other GPU Scan implementations adopt.
+Warp partitioning involves dividing threads from a block into warps, aiming to partition based on consecutive increasing thread indices.
+This approach minimizes divergent branches and allows warps to retire early, freeing up GPU resources for other available work.
+
+Lastly, while the Naive & Hardware-Efficient Scan and Work-Efficient & Hardware-Efficient Scan are optimized with shared memory and warp partitioning practices, they introduce a bottleneck known as bank conflicts that are specific to shared memory.
+Shared memory is divided into 32 banks, allowing each bank to service one address per cycle.
+Although bank conflicts were not an issue in the Naive & Hardware-Efficient Scan, they emerged during the later implementation stages of the Work-Efficient & Hardware-Efficient Scan.
+To address this, a padding element was added after every 32 shared memory elements, effectively alleviating these bank conflicts and leading to improved performance.
+
+
+
+### Sample Output
+This is a sample of the output used to test the correctness and timing of all Scan and Stream Compaction implementations.
+
+In this sample, array size is 220
+
+```
+****************
+** SCAN TESTS **
+****************
+ [ 12 47 23 8 49 44 11 3 48 22 33 4 12 ... 31 0 ]
+==== cpu scan, power-of-two ====
+ [ 0 12 59 82 90 139 183 194 197 245 267 300 304 ... 25666918 25666949 ]
+ elapsed time: 0.6517ms (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+ [ 0 12 59 82 90 139 183 194 197 245 267 300 304 ... 25666812 25666855 ]
+ elapsed time: 0.5208ms (std::chrono Measured)
+ passed
+==== naive scan, power-of-two, no shared memory ====
+ elapsed time: 1.07728ms (CUDA Measured)
+ passed
+==== naive scan, non-power-of-two, no shared memory ====
+ elapsed time: 0.843456ms (CUDA Measured)
+ passed
+==== naive scan, power-of-two, shared memory ====
+ elapsed time: 0.49872ms (CUDA Measured)
+ passed
+==== naive scan, non-power-of-two, shared memory ====
+ elapsed time: 0.477088ms (CUDA Measured)
+ passed
+==== work-efficient scan, power-of-two, no shared memory ====
+ elapsed time: 0.566848ms (CUDA Measured)
+ passed
+==== work-efficient scan, non-power-of-two, no shared memory ====
+ elapsed time: 0.468896ms (CUDA Measured)
+ passed
+==== work-efficient scan, power-of-two, shared memory ====
+ elapsed time: 0.216576ms (CUDA Measured)
+ passed
+==== work-efficient scan, non-power-of-two, shared memory ====
+ elapsed time: 0.2104ms (CUDA Measured)
+ passed
+==== thrust scan, power-of-two ====
+ elapsed time: 0.621248ms (CUDA Measured)
+ passed
+==== thrust scan, non-power-of-two ====
+ elapsed time: 0.183968ms (CUDA Measured)
+ passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+ [ 0 1 1 0 1 0 3 3 0 0 3 0 0 ... 1 0 ]
+==== cpu compact without scan, power-of-two ====
+ elapsed time: 2.4059ms (std::chrono Measured)
+ [ 1 1 1 3 3 3 1 1 1 3 1 3 1 ... 3 1 ]
+ passed
+==== cpu compact without scan, non-power-of-two ====
+ elapsed time: 2.4662ms (std::chrono Measured)
+ [ 1 1 1 3 3 3 1 1 1 3 1 3 1 ... 3 2 ]
+ passed
+==== cpu compact with scan, power-of-two ====
+ elapsed time: 4.6895ms (std::chrono Measured)
+ passed
+==== cpu compact with scan, non-power-of-two ====
+ elapsed time: 5.1045ms (std::chrono Measured)
+ passed
+==== work-efficient compact, power-of-two, no shared memory ====
+ elapsed time: 1.39776ms (CUDA Measured)
+ passed
+==== work-efficient compact, non-power-of-two, no shared memory ====
+ elapsed time: 0.817856ms (CUDA Measured)
+ passed
+==== work-efficient compact, power-of-two, shared memory ====
+ elapsed time: 0.602912ms (CUDA Measured)
+ passed
+==== work-efficient compact, non-power-of-two, shared memory ====
+ elapsed time: 0.588288ms (CUDA Measured)
+ passed
+==== thrust compact, power-of-two ====
+ elapsed time: 0.28656ms (CUDA Measured)
+ passed
+==== thrust compact, non-power-of-two ====
+ elapsed time: 0.283488ms (CUDA Measured)
+ passed
+```
diff --git a/img/compaction_with_scan.PNG b/img/compaction_with_scan.PNG
new file mode 100644
index 0000000..a6f780f
Binary files /dev/null and b/img/compaction_with_scan.PNG differ
diff --git a/img/gpu_excluisve_efficient_downsweep.PNG b/img/gpu_excluisve_efficient_downsweep.PNG
new file mode 100644
index 0000000..268ff26
Binary files /dev/null and b/img/gpu_excluisve_efficient_downsweep.PNG differ
diff --git a/img/gpu_exclusive_efficient_upsweep.PNG b/img/gpu_exclusive_efficient_upsweep.PNG
new file mode 100644
index 0000000..7344fab
Binary files /dev/null and b/img/gpu_exclusive_efficient_upsweep.PNG differ
diff --git a/img/gpu_inclusive_naive.PNG b/img/gpu_inclusive_naive.PNG
new file mode 100644
index 0000000..3f3f874
Binary files /dev/null and b/img/gpu_inclusive_naive.PNG differ
diff --git a/img/runtime_vs_arraysize.png b/img/runtime_vs_arraysize.png
new file mode 100644
index 0000000..d8e69c8
Binary files /dev/null and b/img/runtime_vs_arraysize.png differ
diff --git a/img/runtime_vs_blocksize.png b/img/runtime_vs_blocksize.png
new file mode 100644
index 0000000..94a00e0
Binary files /dev/null and b/img/runtime_vs_blocksize.png differ
diff --git a/img/scan_array_arb_size.PNG b/img/scan_array_arb_size.PNG
new file mode 100644
index 0000000..c6daf54
Binary files /dev/null and b/img/scan_array_arb_size.PNG differ
diff --git a/img/scan_visual.PNG b/img/scan_visual.PNG
new file mode 100644
index 0000000..811775a
Binary files /dev/null and b/img/scan_visual.PNG differ
diff --git a/img/stream_compaction_visual.PNG b/img/stream_compaction_visual.PNG
new file mode 100644
index 0000000..26544d4
Binary files /dev/null and b/img/stream_compaction_visual.PNG differ
diff --git a/src/main.cpp b/src/main.cpp
index 896ac2b..ec5a484 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,12 +13,33 @@
#include
#include "testing_helpers.hpp"
+const bool runXNumTestsAndGetAverage = false;
+const int numTests = 10;
+int testCount = 0;
+float testResults[numTests];
+
const int SIZE = 1 << 8; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
int *c = new int[SIZE];
+void printAverageOfTestsAndResetTestCount(bool ranOnGPU)
+{
+ float average = 0.f;
+ for (int i = 0; i < numTests; ++i)
+ {
+ average += testResults[i];
+ }
+ average /= numTests;
+
+ std::cout << "Average Runtime of " << numTests << " runs: ";
+ printElapsedTime(average, ranOnGPU ? "(CUDA Measured)" : "(std::chrono Measured)");
+ std::cout << std::endl;
+
+ testCount = 0;
+}
+
int main(int argc, char* argv[]) {
// Scan tests
@@ -34,66 +55,275 @@ int main(int argc, char* argv[]) {
// initialize b using StreamCompaction::CPU::scan you implement
// We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
// At first all cases passed because b && c are all zeroes.
- zeroArray(SIZE, b);
- printDesc("cpu scan, power-of-two");
- StreamCompaction::CPU::scan(SIZE, b, a);
- printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
- printArray(SIZE, b, true);
- zeroArray(SIZE, c);
- printDesc("cpu scan, non-power-of-two");
- StreamCompaction::CPU::scan(NPOT, c, a);
- printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
- printArray(NPOT, b, true);
- printCmpResult(NPOT, b, c);
+ do
+ {
+ zeroArray(SIZE, b);
+ if (testCount == 0)
+ {
+ printDesc("cpu scan, power-of-two");
+ }
+ StreamCompaction::CPU::scan(SIZE, b, a);
+ testResults[testCount] = StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation();
+ if (testCount == 0)
+ {
+ printArray(SIZE, b, true);
+ }
+ printElapsedTime(testResults[testCount], "(std::chrono Measured)");
+ } while (runXNumTestsAndGetAverage && ++testCount < numTests);
- zeroArray(SIZE, c);
- printDesc("naive scan, power-of-two");
- StreamCompaction::Naive::scan(SIZE, c, a);
- printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(SIZE, c, true);
- printCmpResult(SIZE, b, c);
-
- /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
- onesArray(SIZE, c);
- printDesc("1s array for finding bugs");
- StreamCompaction::Naive::scan(SIZE, c, a);
- printArray(SIZE, c, true); */
+ if (runXNumTestsAndGetAverage)
+ {
+ printAverageOfTestsAndResetTestCount(false);
+ }
- zeroArray(SIZE, c);
- printDesc("naive scan, non-power-of-two");
- StreamCompaction::Naive::scan(NPOT, c, a);
- printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(SIZE, c, true);
- printCmpResult(NPOT, b, c);
+ do
+ {
+ zeroArray(SIZE, c);
+ if (testCount == 0)
+ {
+ printDesc("cpu scan, non-power-of-two");
+ }
+ StreamCompaction::CPU::scan(NPOT, c, a);
+ testResults[testCount] = StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation();
+ if (testCount == 0)
+ {
+ printArray(NPOT, c, true);
+ }
+ printElapsedTime(testResults[testCount], "(std::chrono Measured)");
+ printCmpResult(NPOT, b, c);
+ } while (runXNumTestsAndGetAverage && ++testCount < numTests);
- zeroArray(SIZE, c);
- printDesc("work-efficient scan, power-of-two");
- StreamCompaction::Efficient::scan(SIZE, c, a);
- printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(SIZE, c, true);
- printCmpResult(SIZE, b, c);
+ if (runXNumTestsAndGetAverage)
+ {
+ printAverageOfTestsAndResetTestCount(false);
+ }
- zeroArray(SIZE, c);
- printDesc("work-efficient scan, non-power-of-two");
- StreamCompaction::Efficient::scan(NPOT, c, a);
- printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(NPOT, c, true);
- printCmpResult(NPOT, b, c);
+ do
+ {
+ zeroArray(SIZE, c);
+ if (testCount == 0)
+ {
+ printDesc("naive scan, power-of-two, no shared memory");
+ }
+ StreamCompaction::Naive::scan(SIZE, c, a, false);
+ testResults[testCount] = StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation();
+ if (testCount == 0)
+ {
+ //printArray(SIZE, c, true);
+ }
+ printElapsedTime(testResults[testCount], "(CUDA Measured)");
+ printCmpResult(SIZE, b, c);
+ } while (runXNumTestsAndGetAverage && ++testCount < numTests);
- zeroArray(SIZE, c);
- printDesc("thrust scan, power-of-two");
- StreamCompaction::Thrust::scan(SIZE, c, a);
- printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(SIZE, c, true);
- printCmpResult(SIZE, b, c);
+ if (runXNumTestsAndGetAverage)
+ {
+ printAverageOfTestsAndResetTestCount(true);
+ }
- zeroArray(SIZE, c);
- printDesc("thrust scan, non-power-of-two");
- StreamCompaction::Thrust::scan(NPOT, c, a);
- printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(NPOT, c, true);
- printCmpResult(NPOT, b, c);
+ /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
+ onesArray(SIZE, c);
+ printDesc("1s array for finding bugs");
+ StreamCompaction::Naive::scan(SIZE, c, a);
+ printArray(SIZE, c, true); */
+
+ do
+ {
+ zeroArray(SIZE, c);
+ if (testCount == 0)
+ {
+ printDesc("naive scan, non-power-of-two, no shared memory");
+ }
+ StreamCompaction::Naive::scan(NPOT, c, a, false);
+ testResults[testCount] = StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation();
+ if (testCount == 0)
+ {
+ //printArray(NPOT, c, true);
+ }
+ printElapsedTime(testResults[testCount], "(CUDA Measured)");
+ printCmpResult(NPOT, b, c);
+ } while (runXNumTestsAndGetAverage && ++testCount < numTests);
+
+ if (runXNumTestsAndGetAverage)
+ {
+ printAverageOfTestsAndResetTestCount(true);
+ }
+
+ do
+ {
+ zeroArray(SIZE, c);
+ if (testCount == 0)
+ {
+ printDesc("naive scan, power-of-two, shared memory");
+ }
+ StreamCompaction::Naive::scan(SIZE, c, a, true);
+ testResults[testCount] = StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation();
+ if (testCount == 0)
+ {
+ //printArray(SIZE, c, true);
+ }
+ printElapsedTime(testResults[testCount], "(CUDA Measured)");
+ printCmpResult(SIZE, b, c);
+ } while (runXNumTestsAndGetAverage && ++testCount < numTests);
+
+ if (runXNumTestsAndGetAverage)
+ {
+ printAverageOfTestsAndResetTestCount(true);
+ }
+
+ do
+ {
+ zeroArray(SIZE, c);
+ if (testCount == 0)
+ {
+ printDesc("naive scan, non-power-of-two, shared memory");
+ }
+ StreamCompaction::Naive::scan(NPOT, c, a, true);
+ testResults[testCount] = StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation();
+ if (testCount == 0)
+ {
+ //printArray(NPOT, c, true);
+ }
+ printElapsedTime(testResults[testCount], "(CUDA Measured)");
+ printCmpResult(NPOT, b, c);
+ } while (runXNumTestsAndGetAverage && ++testCount < numTests);
+
+ if (runXNumTestsAndGetAverage)
+ {
+ printAverageOfTestsAndResetTestCount(true);
+ }
+
+ do
+ {
+ zeroArray(SIZE, c);
+ if (testCount == 0)
+ {
+ printDesc("work-efficient scan, power-of-two, no shared memory");
+ }
+ StreamCompaction::Efficient::scan(SIZE, c, a, false);
+ testResults[testCount] = StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation();
+ if (testCount == 0)
+ {
+ //printArray(SIZE, c, true);
+ }
+ printElapsedTime(testResults[testCount], "(CUDA Measured)");
+ printCmpResult(SIZE, b, c);
+ } while (runXNumTestsAndGetAverage && ++testCount < numTests);
+
+ if (runXNumTestsAndGetAverage)
+ {
+ printAverageOfTestsAndResetTestCount(true);
+ }
+
+ do
+ {
+ zeroArray(SIZE, c);
+ if (testCount == 0)
+ {
+ printDesc("work-efficient scan, non-power-of-two, no shared memory");
+ }
+ StreamCompaction::Efficient::scan(NPOT, c, a, false);
+ testResults[testCount] = StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation();
+ if (testCount == 0)
+ {
+ //printArray(NPOT, c, true);
+ }
+ printElapsedTime(testResults[testCount], "(CUDA Measured)");
+ printCmpResult(NPOT, b, c);
+ } while (runXNumTestsAndGetAverage && ++testCount < numTests);
+
+ if (runXNumTestsAndGetAverage)
+ {
+ printAverageOfTestsAndResetTestCount(true);
+ }
+
+ do
+ {
+ zeroArray(SIZE, c);
+ if (testCount == 0)
+ {
+ printDesc("work-efficient scan, power-of-two, shared memory");
+ }
+ StreamCompaction::Efficient::scan(SIZE, c, a, true);
+ testResults[testCount] = StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation();
+ if (testCount == 0)
+ {
+ //printArray(SIZE, c, true);
+ }
+ printElapsedTime(testResults[testCount], "(CUDA Measured)");
+ printCmpResult(SIZE, b, c);
+ } while (runXNumTestsAndGetAverage && ++testCount < numTests);
+
+ if (runXNumTestsAndGetAverage)
+ {
+ printAverageOfTestsAndResetTestCount(true);
+ }
+
+ do
+ {
+ zeroArray(SIZE, c);
+ if (testCount == 0)
+ {
+ printDesc("work-efficient scan, non-power-of-two, shared memory");
+ }
+ StreamCompaction::Efficient::scan(NPOT, c, a, true);
+ testResults[testCount] = StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation();
+ if (testCount == 0)
+ {
+ //printArray(NPOT, c, true);
+ }
+ printElapsedTime(testResults[testCount], "(CUDA Measured)");
+ printCmpResult(NPOT, b, c);
+ } while (runXNumTestsAndGetAverage && ++testCount < numTests);
+
+ if (runXNumTestsAndGetAverage)
+ {
+ printAverageOfTestsAndResetTestCount(true);
+ }
+
+ do
+ {
+ zeroArray(SIZE, c);
+ if (testCount == 0)
+ {
+ printDesc("thrust scan, power-of-two");
+ }
+ StreamCompaction::Thrust::scan(SIZE, c, a);
+ testResults[testCount] = StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation();
+ if (testCount == 0)
+ {
+ //printArray(SIZE, c, true);
+ }
+ printElapsedTime(testResults[testCount], "(CUDA Measured)");
+ printCmpResult(SIZE, b, c);
+ } while (runXNumTestsAndGetAverage && ++testCount < numTests);
+
+ if (runXNumTestsAndGetAverage)
+ {
+ printAverageOfTestsAndResetTestCount(true);
+ }
+
+ do
+ {
+ zeroArray(SIZE, c);
+ if (testCount == 0)
+ {
+ printDesc("thrust scan, non-power-of-two");
+ }
+ StreamCompaction::Thrust::scan(NPOT, c, a);
+ testResults[testCount] = StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation();
+ if (testCount == 0)
+ {
+ //printArray(NPOT, c, true);
+ }
+ printElapsedTime(testResults[testCount], "(CUDA Measured)");
+ printCmpResult(NPOT, b, c);
+ } while (runXNumTestsAndGetAverage && ++testCount < numTests);
+
+ if (runXNumTestsAndGetAverage)
+ {
+ printAverageOfTestsAndResetTestCount(true);
+ }
printf("\n");
printf("*****************************\n");
@@ -127,26 +357,61 @@ int main(int argc, char* argv[]) {
printCmpLenResult(count, expectedNPOT, b, c);
zeroArray(SIZE, c);
- printDesc("cpu compact with scan");
+ printDesc("cpu compact with scan, power-of-two");
count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
- printArray(count, c, true);
+ //printArray(count, c, true);
+ printCmpLenResult(count, expectedCount, b, c);
+
+ zeroArray(SIZE, c);
+ printDesc("cpu compact with scan, non-power-of-two");
+ count = StreamCompaction::CPU::compactWithScan(NPOT, c, a);
+ printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+ //printArray(count, c, true);
+ printCmpLenResult(count, expectedNPOT, b, c);
+
+ zeroArray(SIZE, c);
+ printDesc("work-efficient compact, power-of-two, no shared memory");
+ count = StreamCompaction::Efficient::compact(SIZE, c, a, false);
+ printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ //printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);
zeroArray(SIZE, c);
- printDesc("work-efficient compact, power-of-two");
- count = StreamCompaction::Efficient::compact(SIZE, c, a);
+ printDesc("work-efficient compact, non-power-of-two, no shared memory");
+ count = StreamCompaction::Efficient::compact(NPOT, c, a, false);
+ printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ //printArray(count, c, true);
+ printCmpLenResult(count, expectedNPOT, b, c);
+
+ zeroArray(SIZE, c);
+ printDesc("work-efficient compact, power-of-two, shared memory");
+ count = StreamCompaction::Efficient::compact(SIZE, c, a, true);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);
zeroArray(SIZE, c);
- printDesc("work-efficient compact, non-power-of-two");
- count = StreamCompaction::Efficient::compact(NPOT, c, a);
+ printDesc("work-efficient compact, non-power-of-two, shared memory");
+ count = StreamCompaction::Efficient::compact(NPOT, c, a, true);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);
+ zeroArray(SIZE, c);
+ printDesc("thrust compact, power-of-two");
+ count = StreamCompaction::Thrust::compact(SIZE, c, a);
+ printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ //printArray(count, c, true);
+ printCmpLenResult(count, expectedCount, b, c);
+
+ zeroArray(SIZE, c);
+ printDesc("thrust compact, non-power-of-two");
+ count = StreamCompaction::Thrust::compact(NPOT, c, a);
+ printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ //printArray(count, c, true);
+ printCmpLenResult(count, expectedNPOT, b, c);
+
system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 2ed6d63..c2aa1b6 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -23,7 +23,14 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
- // TODO
+
+ int g_index = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (g_index >= n)
+ {
+ return;
+ }
+
+ bools[g_index] = idata[g_index] == 0 ? 0 : 1;
}
/**
@@ -32,7 +39,16 @@ namespace StreamCompaction {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
- // TODO
+
+ int g_index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+ // index out of bounds or not a "hit"
+ if ((g_index >= n) || (bools[g_index] == 0))
+ {
+ return;
+ }
+
+ odata[indices[g_index]] = idata[g_index];
}
}
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
index d2c1fed..37ea45f 100644
--- a/stream_compaction/common.h
+++ b/stream_compaction/common.h
@@ -2,6 +2,7 @@
#include
#include
+#include
#include
#include
@@ -10,6 +11,8 @@
#include
#include
+#define blockSize 32
+
#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa11..7b43002 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -17,9 +17,23 @@ namespace StreamCompaction {
* For performance analysis, this is supposed to be a simple for loop.
* (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
*/
+
+ void exclusivePrefixSum(const int n, const int* idata, int* odata)
+ {
+ int prefixSum = 0;
+
+ for (int i = 0; i < n; ++i)
+ {
+ odata[i] = prefixSum;
+ prefixSum += idata[i];
+ }
+ }
+
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
- // TODO
+
+ exclusivePrefixSum(n, idata, odata);
+
timer().endCpuTimer();
}
@@ -30,9 +44,20 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
- // TODO
+
+ int compactedCount = 0;
+
+ for (int i = 0; i < n; ++i)
+ {
+ if (idata[i] != 0)
+ {
+ odata[compactedCount++] = idata[i];
+ }
+ }
+
timer().endCpuTimer();
- return -1;
+
+ return compactedCount;
}
/**
@@ -42,9 +67,43 @@ namespace StreamCompaction {
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
- // TODO
+
+ if (n <= 0)
+ {
+ timer().endCpuTimer();
+ return 0;
+ }
+
+ int* binaryMap = new int[n];
+ int* exclusivePrefixSumResult = new int[n];
+
+ // map the input to an array of 0s and 1s
+ for (int i = 0; i < n; ++i)
+ {
+ binaryMap[i] = idata[i] == 0 ? 0 : 1;
+ }
+
+ // scan the array of 0s and 1s
+ exclusivePrefixSum(n, binaryMap, exclusivePrefixSumResult);
+
+ // scatter
+ int compactedCount = 0;
+
+ for (int i = 0; i < n; ++i)
+ {
+ if (binaryMap[i] == 1)
+ {
+ odata[exclusivePrefixSumResult[i]] = idata[i];
+ ++compactedCount;
+ }
+ }
+
timer().endCpuTimer();
- return -1;
+
+ delete[] binaryMap;
+ delete[] exclusivePrefixSumResult;
+
+ return compactedCount;
}
}
}
diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h
index 873c047..4e044a8 100644
--- a/stream_compaction/cpu.h
+++ b/stream_compaction/cpu.h
@@ -6,6 +6,8 @@ namespace StreamCompaction {
namespace CPU {
StreamCompaction::Common::PerformanceTimer& timer();
+ void exclusivePrefixSum(const int n, const int* idata, int* odata);
+
void scan(int n, int *odata, const int *idata);
int compactWithoutScan(int n, int *odata, const int *idata);
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346e..ccb33c5 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -3,6 +3,15 @@
#include "common.h"
#include "efficient.h"
+#define NUM_BANKS 32
+#define LOG_NUM_BANKS 5
+#ifdef ZERO_BANK_CONFLICTS
+#define CONFLICT_FREE_OFFSET(n) \
+ ((n) >> (LOG_NUM_BANKS) + (n) >> (2 * LOG_NUM_BANKS))
+#else
+#define CONFLICT_FREE_OFFSET(n) ((n) >> LOG_NUM_BANKS)
+#endif
+
namespace StreamCompaction {
namespace Efficient {
using StreamCompaction::Common::PerformanceTimer;
@@ -12,13 +21,296 @@ namespace StreamCompaction {
return timer;
}
+ __global__ void kernelReductionPass(const int reqThdsForPass, const int offset, int* data)
+ {
+ int g_index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+ if (g_index >= reqThdsForPass)
+ {
+ return;
+ }
+
+ int left_node_g_index = offset * (2 * g_index + 1) - 1;
+ int right_node_g_index = offset * (2 * g_index + 2) - 1;
+
+ data[right_node_g_index] += data[left_node_g_index];
+ }
+
+ __global__ void kernelDownSweepPass(const int reqThdsForPass, const int offset, int* data)
+ {
+ int g_index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+ if (g_index >= reqThdsForPass)
+ {
+ return;
+ }
+
+ int left_node_g_index = offset * (2 * g_index + 1) - 1;
+ int right_node_g_index = offset * (2 * g_index + 2) - 1;
+
+ int temp = data[left_node_g_index];
+ data[left_node_g_index] = data[right_node_g_index];
+ data[right_node_g_index] += temp;
+ }
+
+ __global__ void kernelEfficientExclusivePrefixSumByBlock(const int reqThdsPerBlock, int* data, int* blockSums)
+ {
+ // allocated on invocation
+ extern __shared__ int shared[];
+
+ int tx = threadIdx.x;
+
+ if (tx >= reqThdsPerBlock)
+ {
+ return;
+ }
+
+ int g_index = 2 * (blockIdx.x * blockDim.x) + tx;
+
+ int ai = tx;
+ int bi = tx + reqThdsPerBlock;
+ int g_ai = g_index;
+ int g_bi = g_index + reqThdsPerBlock;
+
+ int bankOffsetA = CONFLICT_FREE_OFFSET(ai);
+ int bankOffsetB = CONFLICT_FREE_OFFSET(bi);
+
+ // Load input into shared memory
+ shared[ai + bankOffsetA] = data[g_ai];
+ shared[bi + bankOffsetB] = data[g_bi];
+
+ int offset = 1;
+
+ // Upsweep (Reduction) phase
+ for (int reqThdsForPass = reqThdsPerBlock; reqThdsForPass > 0; reqThdsForPass >>= 1)
+ {
+ __syncthreads();
+
+ if (tx < reqThdsForPass)
+ {
+ int left_node_shrd_index = offset * (2 * tx + 1) - 1;
+ int right_node_shrd_index = offset * (2 * tx + 2) - 1;
+ left_node_shrd_index += CONFLICT_FREE_OFFSET(left_node_shrd_index);
+ right_node_shrd_index += CONFLICT_FREE_OFFSET(right_node_shrd_index);
+
+ shared[right_node_shrd_index] += shared[left_node_shrd_index];
+ }
+
+ offset *= 2;
+ }
+
+ // Write the block sum and then clear it for the Downsweep
+ if (tx == 0)
+ {
+ int n_minus_1 = 2 * reqThdsPerBlock - 1;
+ n_minus_1 += CONFLICT_FREE_OFFSET(n_minus_1);
+
+ blockSums[blockIdx.x] = shared[n_minus_1];
+ shared[n_minus_1] = 0;
+ }
+
+ // Downsweep phase
+ for (int reqThdsForPass = 1; reqThdsForPass <= reqThdsPerBlock; reqThdsForPass *= 2)
+ {
+ offset >>= 1;
+ __syncthreads();
+
+ if (tx < reqThdsForPass)
+ {
+ int left_node_shrd_index = offset * (2 * tx + 1) - 1;
+ int right_node_shrd_index = offset * (2 * tx + 2) - 1;
+ left_node_shrd_index += CONFLICT_FREE_OFFSET(left_node_shrd_index);
+ right_node_shrd_index += CONFLICT_FREE_OFFSET(right_node_shrd_index);
+
+ int temp = shared[left_node_shrd_index];
+ shared[left_node_shrd_index] = shared[right_node_shrd_index];
+ shared[right_node_shrd_index] += temp;
+ }
+ }
+
+ __syncthreads();
+
+ // Write the results back to global memory
+ data[g_ai] = shared[ai + bankOffsetA];
+ data[g_bi] = shared[bi + bankOffsetB];
+ }
+
+ __global__ void kernelAddBlockSumsToBlockData(const int* blockSums, int* data)
+ {
+ int g_index = 2 * ((blockIdx.x * blockDim.x) + threadIdx.x);
+
+ // ***** blockSize must be a power of 2 ***** //
+ // no reason to check bounds because this kernel is only called if we recursively scan,
+ // meaning the buffer length (of int* data) is going to be (2 * (blockSize)) * (2 to the power of n),
+ // where n is the depth of recursion. To reiterate the first line, n is always >= 1
+
+ data[g_index] += blockSums[blockIdx.x];
+ data[g_index + 1] += blockSums[blockIdx.x];
+ }
+
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
- void scan(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
- // TODO
- timer().endGpuTimer();
+ void scan(int n, int* odata, const int* idata, bool useSharedMemory)
+ {
+ if (useSharedMemory)
+ {
+ efficientExclusivePrefixSumSharedMemory(true, n, idata, odata);
+ }
+ else
+ {
+ efficientExclusivePrefixSum(true, n, idata, odata);
+ }
+ }
+
+ void efficientExclusivePrefixSum(const bool useGpuTimer, const int n, const int* idata, int* odata)
+ {
+ int totalPasses = ilog2ceil(n);
+ int bufferLength = 1 << totalPasses;
+
+ int* dev_data;
+
+ cudaMalloc((void**)&dev_data, sizeof(int) * bufferLength);
+ checkCUDAError("cudaMalloc dev_data failed!");
+
+ cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+ checkCUDAError("memcpy idata to dev_data failed!");
+
+ if (bufferLength > n)
+ {
+ cudaMemset(dev_data + n, 0, sizeof(int) * (bufferLength - n));
+ checkCUDAError("cudaMemset elements n to bufferLength - 1 in dev_data to 0 failed!");
+ }
+
+ int blocksPerGrid = 0;
+ int offset = 1;
+
+ if (useGpuTimer) timer().startGpuTimer();
+
+
+ for (int reqThdsForPass = bufferLength >> 1; reqThdsForPass > 0; reqThdsForPass >>= 1)
+ {
+ blocksPerGrid = (reqThdsForPass + blockSize - 1) / blockSize;
+ kernelReductionPass<<>>(reqThdsForPass, offset, dev_data);
+ checkCUDAError("kernelReductionPass failed!");
+
+ offset *= 2;
+ }
+
+ cudaMemset(dev_data + bufferLength - 1, 0, sizeof(int));
+ checkCUDAError("cudaMemset last element in dev_data to 0 failed!");
+
+ for (int reqThdsForPass = 1; reqThdsForPass < bufferLength; reqThdsForPass *= 2)
+ {
+ offset >>= 1;
+
+ blocksPerGrid = (reqThdsForPass + blockSize - 1) / blockSize;
+ kernelDownSweepPass<<>>(reqThdsForPass, offset, dev_data);
+ checkCUDAError("kernelDownSweepPass failed!");
+ }
+
+
+ if (useGpuTimer) timer().endGpuTimer();
+
+ cudaMemcpy(odata, dev_data, sizeof(int) * n, cudaMemcpyDeviceToHost);
+ checkCUDAError("memcpy dev_data to odata failed!");
+
+ cudaFree(dev_data);
+ checkCUDAError("cudaFree failed!");
+ }
+
+ void efficientExclusivePrefixSumSharedMemory(const bool useGpuTimer, const int n, const int* idata, int* odata)
+ {
+ int totalPasses = ilog2ceil(n);
+ int bufferLength = 1 << totalPasses;
+ int reqThdsTotal = bufferLength >> 1;
+ int blocksPerGrid = (reqThdsTotal + blockSize - 1) / blockSize;
+
+ int* dev_data;
+ int* dev_sums;
+
+ cudaMalloc((void**)&dev_data, sizeof(int) * bufferLength);
+ checkCUDAError("cudaMalloc dev_data failed!");
+
+ cudaMalloc((void**)&dev_sums, sizeof(int) * blocksPerGrid);
+ checkCUDAError("cudaMalloc dev_sums failed!");
+
+ cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+ checkCUDAError("memcpy idata to dev_data failed!");
+
+ if (bufferLength > n)
+ {
+ cudaMemset(dev_data + n, 0, sizeof(int) * (bufferLength - n));
+ checkCUDAError("cudaMemset elements n to bufferLength - 1 in dev_data to 0 failed!");
+ }
+
+ int reqThdsPerBlock = std::min(reqThdsTotal, blockSize);
+ int procElemsPerBlock = 2 * reqThdsPerBlock;
+ int sharedMemoryBytes = sizeof(int) * (procElemsPerBlock + (procElemsPerBlock / NUM_BANKS) + (procElemsPerBlock / (NUM_BANKS * NUM_BANKS)));
+
+ if (useGpuTimer) timer().startGpuTimer();
+
+
+ efficientExclusivePrefixSumAnyNumberOfBlocks(sharedMemoryBytes, reqThdsPerBlock, blocksPerGrid, dev_data, dev_sums);
+
+
+ if (useGpuTimer) timer().endGpuTimer();
+
+ cudaMemcpy(odata, dev_data, sizeof(int) * n, cudaMemcpyDeviceToHost);
+ checkCUDAError("memcpy dev_data to odata failed!");
+
+ cudaFree(dev_data);
+ cudaFree(dev_sums);
+ checkCUDAError("cudaFree failed!");
+ }
+
+ // iterative approach is possible
+ // for the sake of submitting this assignement on time, this will have to be explored at a later time
+ void efficientExclusivePrefixSumAnyNumberOfBlocks(const int sharedMemoryBytes, const int reqThdsPerBlock, const int blocksPerGrid, int* data, int* blockSums)
+ {
+ kernelEfficientExclusivePrefixSumByBlock<<>>(reqThdsPerBlock, data, blockSums);
+ checkCUDAError("kernelEfficientExclusivePrefixSumByBlock failed!");
+
+ if (blocksPerGrid > 1)
+ {
+ int n_blockSums = blocksPerGrid;
+
+ int totalPasses_blockSums = ilog2ceil(n_blockSums);
+ int bufferLength_blockSums = 1 << totalPasses_blockSums;
+ int reqThdsTotal_blockSums = bufferLength_blockSums >> 1;
+ int blocksPerGrid_blockSums = (reqThdsTotal_blockSums + blockSize - 1) / blockSize;
+
+ int* dev_sums;
+ int* dev_new_sums;
+
+ cudaMalloc((void**)&dev_sums, sizeof(int) * bufferLength_blockSums);
+ checkCUDAError("cudaMalloc dev_sums failed!");
+
+ cudaMalloc((void**)&dev_new_sums, sizeof(int) * blocksPerGrid_blockSums);
+ checkCUDAError("cudaMalloc dev_new_sums failed!");
+
+ cudaMemcpy(dev_sums, blockSums, sizeof(int) * n_blockSums, cudaMemcpyDeviceToDevice);
+ checkCUDAError("memcpy blockSums to dev_sums failed!");
+
+ if (bufferLength_blockSums > n_blockSums)
+ {
+ cudaMemset(dev_sums + n_blockSums, 0, sizeof(int) * (bufferLength_blockSums - n_blockSums));
+ checkCUDAError("cudaMemset elements n_blockSums to bufferLength_blockSums - 1 in dev_sums to 0 failed!");
+ }
+
+ int reqThdsPerBlock_blockSums = std::min(reqThdsTotal_blockSums, blockSize);
+ int procElemsPerBlock_blockSums = 2 * reqThdsPerBlock_blockSums;
+ int sharedMemoryBytes_blockSums = sizeof(int) * (procElemsPerBlock_blockSums + (procElemsPerBlock_blockSums / NUM_BANKS) + (procElemsPerBlock_blockSums / (NUM_BANKS * NUM_BANKS)));
+
+ efficientExclusivePrefixSumAnyNumberOfBlocks(sharedMemoryBytes_blockSums, reqThdsPerBlock_blockSums, blocksPerGrid_blockSums, dev_sums, dev_new_sums);
+
+ kernelAddBlockSumsToBlockData<<>>(dev_sums, data);
+ checkCUDAError("kernelAddBlockSumsToBlockData failed!");
+
+ cudaFree(dev_sums);
+ cudaFree(dev_new_sums);
+ checkCUDAError("cudaFree failed!");
+ }
}
/**
@@ -30,11 +322,70 @@ namespace StreamCompaction {
* @param idata The array of elements to compact.
* @returns The number of elements remaining after compaction.
*/
- int compact(int n, int *odata, const int *idata) {
+ int compact(int n, int* odata, const int* idata, bool useSharedMemory)
+ {
+ int* dev_idata;
+ int* dev_odata;
+ int* dev_binaryMap;
+ int* dev_exclusivePrefixSumResult;
+
+ cudaMalloc((void**)&dev_idata, sizeof(int) * n);
+ checkCUDAError("cudaMalloc dev_idata failed!");
+
+ cudaMalloc((void**)&dev_odata, sizeof(int) * n);
+ checkCUDAError("cudaMalloc dev_odata failed!");
+
+ cudaMalloc((void**)&dev_binaryMap, sizeof(int) * n);
+ checkCUDAError("cudaMalloc dev_binaryMap failed!");
+
+ cudaMalloc((void**)&dev_exclusivePrefixSumResult, sizeof(int) * n);
+ checkCUDAError("cudaMalloc dev_exclusivePrefixSumResult failed!");
+
+ cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+ checkCUDAError("memcpy idata to dev_idata failed!");
+
+ int blocksPerGrid = (n + blockSize - 1) / blockSize;
+
timer().startGpuTimer();
- // TODO
+
+
+ Common::kernMapToBoolean<<>>(n, dev_binaryMap, dev_idata);
+ checkCUDAError("kernMapToBoolean failed!");
+
+ if (useSharedMemory)
+ {
+ efficientExclusivePrefixSumSharedMemory(false, n, dev_binaryMap, dev_exclusivePrefixSumResult);
+ }
+ else
+ {
+ efficientExclusivePrefixSum(false, n, dev_binaryMap, dev_exclusivePrefixSumResult);
+ }
+
+ Common::kernScatter<<>>(n, dev_odata, dev_idata, dev_binaryMap, dev_exclusivePrefixSumResult);
+ checkCUDAError("kernScatter failed!");
+
+
timer().endGpuTimer();
- return -1;
+
+ int compactedCount;
+ cudaMemcpy(&compactedCount, dev_exclusivePrefixSumResult + (n - 1), sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy dev_exclusivePrefixSumResult last element to compactedCount failed!");
+
+ if (idata[n - 1] != 0)
+ {
+ ++compactedCount;
+ }
+
+ cudaMemcpy(odata, dev_odata, sizeof(int) * compactedCount, cudaMemcpyDeviceToHost);
+ checkCUDAError("memcpy dev_odata to odata failed!");
+
+ cudaFree(dev_idata);
+ cudaFree(dev_odata);
+ cudaFree(dev_binaryMap);
+ cudaFree(dev_exclusivePrefixSumResult);
+ checkCUDAError("cudaFree failed!");
+
+ return compactedCount;
}
}
}
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 803cb4f..0de9d06 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -6,8 +6,22 @@ namespace StreamCompaction {
namespace Efficient {
StreamCompaction::Common::PerformanceTimer& timer();
- void scan(int n, int *odata, const int *idata);
+ __global__ void kernelReductionPass(const int reqThdsForPass, const int offset, int* data);
- int compact(int n, int *odata, const int *idata);
+ __global__ void kernelDownSweepPass(const int reqThdsForPass, const int offset, int* data);
+
+ __global__ void kernelEfficientExclusivePrefixSumByBlock(const int reqThdsPerBlock, int* data, int* blockSums);
+
+ __global__ void kernelAddBlockSumsToBlockData(const int* blockSums, int* data);
+
+ void scan(int n, int* odata, const int* idata, bool useSharedMemory);
+
+ void efficientExclusivePrefixSum(const bool useGpuTimer, const int n, const int* idata, int* odata);
+
+ void efficientExclusivePrefixSumSharedMemory(const bool useGpuTimer, const int n, const int* idata, int* odata);
+
+ void efficientExclusivePrefixSumAnyNumberOfBlocks(const int sharedMemoryBytes, const int reqThdsPerBlock, const int blocksPerGrid, int* data, int* blockSums);
+
+ int compact(int n, int* odata, const int* idata, bool useSharedMemory);
}
}
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 4308876..2331673 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -11,15 +11,296 @@ namespace StreamCompaction {
static PerformanceTimer timer;
return timer;
}
- // TODO: __global__
+
+ __global__ void kernelInclusiveToExclusivePrefixSum(const int n, const int* idata, int* odata)
+ {
+ int g_index = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (g_index >= n)
+ {
+ return;
+ }
+
+ if (g_index > 0)
+ {
+ odata[g_index] = idata[g_index - 1];
+ }
+ else if (g_index == 0)
+ {
+ odata[g_index] = 0;
+ }
+ }
+
+ __global__ void kernelNaiveInclusivePrefixSumPass(const int n, const int offset, const int* idata, int* odata)
+ {
+ int g_index = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (g_index >= n)
+ {
+ return;
+ }
+
+ if (g_index >= offset)
+ {
+ odata[g_index] = idata[g_index] + idata[g_index - offset];
+ }
+ else
+ {
+ odata[g_index] = idata[g_index];
+ }
+ }
+
+ __global__ void kernelNaiveInclusivePrefixSumByBlock(const int n, const int* idata, int* odata)
+ {
+ // allocated on invocation
+ // double buffer for ping-ponging
+ extern __shared__ int shared[];
+
+ int g_index = blockIdx.x * blockDim.x + threadIdx.x;
+ if (g_index >= n)
+ {
+ return;
+ }
+
+ int tx = threadIdx.x;
+
+ // Load input into shared memory
+ // Only need to write to the first half since our first write will be to the second half
+ shared[tx] = idata[g_index];
+ __syncthreads();
+
+ // identify which half of double buffer is read-half and write-half
+ int writeBuffer = 0;
+ int readBuffer = 1;
+
+ for (int offset = 1; offset < blockSize; offset *= 2)
+ {
+ // swap double buffer indices
+ writeBuffer = 1 - writeBuffer;
+ readBuffer = 1 - writeBuffer;
+
+ if (tx >= offset)
+ {
+ shared[writeBuffer * blockSize + tx] = shared[readBuffer * blockSize + tx] + shared[readBuffer * blockSize + tx - offset];
+ }
+ else
+ {
+ shared[writeBuffer * blockSize + tx] = shared[readBuffer * blockSize + tx];
+ }
+ __syncthreads();
+ }
+
+ // write output
+ odata[g_index] = shared[writeBuffer * blockSize + tx];
+ }
+
+ __global__ void kernelNaiveExclusivePrefixSumByBlock(const int n, const int* idata, int* odata)
+ {
+ // allocated on invocation
+ // double buffer for ping-ponging
+ extern __shared__ int shared[];
+
+ int g_index = blockIdx.x * blockDim.x + threadIdx.x;
+ if (g_index >= n)
+ {
+ return;
+ }
+
+ int tx = threadIdx.x;
+
+ // Load input into shared memory
+ // Exclusive scan - shift all elements right by one and set first element to 0
+ // Only need to write to the first half since our first write will be to the second half
+ shared[tx] = (tx > 0) ? idata[g_index - 1] : 0;
+ __syncthreads();
+
+ // identify which half of double buffer is read-half and write-half
+ int writeBuffer = 0;
+ int readBuffer = 1;
+
+ for (int offset = 1; offset < blockSize; offset *= 2)
+ {
+ // swap double buffer indices
+ writeBuffer = 1 - writeBuffer;
+ readBuffer = 1 - writeBuffer;
+
+ if (tx >= offset)
+ {
+ shared[writeBuffer * blockSize + tx] = shared[readBuffer * blockSize + tx] + shared[readBuffer * blockSize + tx - offset];
+ }
+ else
+ {
+ shared[writeBuffer * blockSize + tx] = shared[readBuffer * blockSize + tx];
+ }
+ __syncthreads();
+ }
+
+ // write output
+ odata[g_index] = shared[writeBuffer * blockSize + tx];
+ }
+
+ __global__ void kernelExtractBlockSums(const int n, const int n_blockSums, const int* idata, int* odata)
+ {
+ int g_index = blockIdx.x * blockDim.x + threadIdx.x;
+ if (g_index >= n_blockSums)
+ {
+ return;
+ }
+
+ odata[g_index] = g_index == n_blockSums - 1 ? idata[n - 1] : idata[(g_index * blockSize) + blockSize - 1];
+ }
+
+ __global__ void kernelAddBlockSumsToBlockData(const int n, const int* blockSums, int* data)
+ {
+ int g_index = blockIdx.x * blockDim.x + threadIdx.x;
+ if (g_index >= n)
+ {
+ return;
+ }
+
+ data[g_index] += blockSums[blockIdx.x];
+ }
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
- void scan(int n, int *odata, const int *idata) {
+ void scan(int n, int* odata, const int* idata, bool useSharedMemory)
+ {
+ if (useSharedMemory)
+ {
+ naiveExclusivePrefixSumSharedMemory(n, idata, odata);
+ }
+ else
+ {
+ naiveExclusivePrefixSum(n, idata, odata);
+ }
+ }
+
+ void naiveExclusivePrefixSum(const int n, const int* idata, int* odata)
+ {
+ int* dev_bufferA;
+ int* dev_bufferB;
+
+ cudaMalloc((void**)&dev_bufferA, sizeof(int) * n);
+ checkCUDAError("cudaMalloc dev_bufferA failed!");
+
+ cudaMalloc((void**)&dev_bufferB, sizeof(int) * n);
+ checkCUDAError("cudaMalloc dev_bufferB failed!");
+
+ cudaMemcpy(dev_bufferA, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+ checkCUDAError("memcpy idata to dev_bufferA failed!");
+
+ int blocksPerGrid = (n + blockSize - 1) / blockSize;
+
+ timer().startGpuTimer();
+
+
+ for (int offset = 1; offset < n; offset *= 2)
+ {
+ kernelNaiveInclusivePrefixSumPass<<>>(n, offset, dev_bufferA, dev_bufferB);
+ checkCUDAError("kernelNaiveInclusivePrefixSumPass failed!");
+
+ // set the input of the next iteration to the output of this iteration
+ std::swap(dev_bufferA, dev_bufferB);
+ }
+
+ kernelInclusiveToExclusivePrefixSum<<>>(n, dev_bufferA, dev_bufferB);
+ checkCUDAError("kernelInclusiveToExclusivePrefixSum failed!");
+
+
+ timer().endGpuTimer();
+
+ cudaMemcpy(odata, dev_bufferB, sizeof(int) * n, cudaMemcpyDeviceToHost);
+ checkCUDAError("memcpy dev_bufferB to odata failed!");
+
+ cudaFree(dev_bufferA);
+ cudaFree(dev_bufferB);
+ checkCUDAError("cudaFree failed!");
+ }
+
+ void naiveExclusivePrefixSumSharedMemory(const int n, const int* idata, int* odata)
+ {
+ int* dev_bufferA;
+ int* dev_bufferB;
+
+ cudaMalloc((void**)&dev_bufferA, sizeof(int) * n);
+ checkCUDAError("cudaMalloc dev_bufferA failed!");
+
+ cudaMalloc((void**)&dev_bufferB, sizeof(int) * n);
+ checkCUDAError("cudaMalloc dev_bufferB failed!");
+
+ cudaMemcpy(dev_bufferA, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+ checkCUDAError("memcpy idata to dev_bufferA failed!");
+
+ int blocksPerGrid = (n + blockSize - 1) / blockSize;
+
+ int sharedMemoryBytes = 2 * blockSize * sizeof(int);
+
timer().startGpuTimer();
- // TODO
+
+
+ naiveInclusivePrefixSumAnyNumberOfBlocks(sharedMemoryBytes, n, blocksPerGrid, dev_bufferA, dev_bufferB);
+
+ kernelInclusiveToExclusivePrefixSum<<>>(n, dev_bufferB, dev_bufferA);
+ checkCUDAError("kernelInclusiveToExclusivePrefixSum failed!");
+
+
timer().endGpuTimer();
+
+ cudaMemcpy(odata, dev_bufferA, sizeof(int) * n, cudaMemcpyDeviceToHost);
+ checkCUDAError("memcpy dev_bufferA to odata failed!");
+
+ cudaFree(dev_bufferA);
+ cudaFree(dev_bufferB);
+ checkCUDAError("cudaFree failed!");
+ }
+
+ // iterative approach is possible if the blockSums buffers are allocated carefully ahead of time, combined with clever indexing of them at each iteration
+ // for the sake of submitting this assignement on time, this will have to be explored at a later time
+ void naiveInclusivePrefixSumAnyNumberOfBlocks(const int sharedMemoryBytes, const int n, const int blocksPerGrid, int* idata, int* odata)
+ {
+ kernelNaiveInclusivePrefixSumByBlock<<>>(n, idata, odata);
+ checkCUDAError("kernelNaiveInclusivePrefixSumByBlock failed!");
+
+ if (blocksPerGrid > 1)
+ {
+ int n_blockSums = blocksPerGrid;
+
+ int* dev_bufferBlockSumsA;
+ int* dev_bufferBlockSumsB;
+
+ cudaMalloc((void**)&dev_bufferBlockSumsA, sizeof(int) * n_blockSums);
+ checkCUDAError("cudaMalloc dev_bufferBlockSumsA failed!");
+
+ cudaMalloc((void**)&dev_bufferBlockSumsB, sizeof(int) * n_blockSums);
+ checkCUDAError("cudaMalloc dev_bufferBlockSumsB failed!");
+
+ int blocksPerGrid_blockSums = (n_blockSums + blockSize - 1) / blockSize;
+
+ kernelExtractBlockSums<<>>(n, n_blockSums, odata, dev_bufferBlockSumsA);
+ checkCUDAError("kernelExtractBlockSums failed!");
+
+ if (blocksPerGrid_blockSums > 1)
+ {
+ naiveInclusivePrefixSumAnyNumberOfBlocks(sharedMemoryBytes, n_blockSums, blocksPerGrid_blockSums, dev_bufferBlockSumsA, dev_bufferBlockSumsB);
+
+ kernelInclusiveToExclusivePrefixSum<<>>(n_blockSums, dev_bufferBlockSumsB, dev_bufferBlockSumsA);
+ checkCUDAError("kernelInclusiveToExclusivePrefixSum failed!");
+
+ kernelAddBlockSumsToBlockData<<>>(n, dev_bufferBlockSumsA, odata);
+ checkCUDAError("kernelAddBlockSumsToBlockData failed!");
+ }
+ else
+ {
+ kernelNaiveExclusivePrefixSumByBlock<<>>(n_blockSums, dev_bufferBlockSumsA, dev_bufferBlockSumsB);
+ checkCUDAError("kernelNaiveExclusivePrefixSumByBlock failed!");
+
+ kernelAddBlockSumsToBlockData<<>>(n, dev_bufferBlockSumsB, odata);
+ checkCUDAError("kernelAddBlockSumsToBlockData failed!");
+ }
+
+ cudaFree(dev_bufferBlockSumsA);
+ cudaFree(dev_bufferBlockSumsB);
+ checkCUDAError("cudaFree failed!");
+ }
}
}
-}
+}
\ No newline at end of file
diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h
index 37dcb06..c447ce2 100644
--- a/stream_compaction/naive.h
+++ b/stream_compaction/naive.h
@@ -6,6 +6,24 @@ namespace StreamCompaction {
namespace Naive {
StreamCompaction::Common::PerformanceTimer& timer();
- void scan(int n, int *odata, const int *idata);
+ __global__ void kernelInclusiveToExclusivePrefixSum(const int n, const int* idata, int* odata);
+
+ __global__ void kernelNaiveInclusivePrefixSumPass(const int n, const int offset, const int* idata, int* odata);
+
+ __global__ void kernelNaiveInclusivePrefixSumByBlock(const int n, const int* idata, int* odata);
+
+ __global__ void kernelNaiveExclusivePrefixSumByBlock(const int n, const int* idata, int* odata);
+
+ __global__ void kernelExtractBlockSums(const int n, const int n_blockSums, const int* idata, int* odata);
+
+ __global__ void kernelAddBlockSumsToBlockData(const int n, const int* blockSums, int* data);
+
+ void scan(int n, int* odata, const int* idata, bool useSharedMemory);
+
+ void naiveExclusivePrefixSum(const int n, const int* idata, int* odata);
+
+ void naiveExclusivePrefixSumSharedMemory(const int n, const int* idata, int* odata);
+
+ void naiveInclusivePrefixSumAnyNumberOfBlocks(const int sharedMemoryBytes, const int n, const int blocksPerGrid, int* idata, int* odata);
}
}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e..f500fc0 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -3,9 +3,17 @@
#include
#include
#include
+#include
#include "common.h"
#include "thrust.h"
+struct is_zero
+{
+ __device__ bool operator()(const int& x) const {
+ return x == 0;
+ }
+};
+
namespace StreamCompaction {
namespace Thrust {
using StreamCompaction::Common::PerformanceTimer;
@@ -18,11 +26,69 @@ namespace StreamCompaction {
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
+
+ int* dev_idata;
+ int* dev_odata;
+
+ cudaMalloc((void**)&dev_idata, sizeof(int) * n);
+ checkCUDAError("cudaMalloc dev_idata failed!");
+
+ cudaMalloc((void**)&dev_odata, sizeof(int) * n);
+ checkCUDAError("cudaMalloc dev_odata failed!");
+
+ cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+ checkCUDAError("memcpy idata to dev_idata failed!");
+
+ thrust::device_ptr dev_thrust_idata(dev_idata);
+ thrust::device_ptr dev_thrust_odata(dev_odata);
+
timer().startGpuTimer();
- // TODO use `thrust::exclusive_scan`
- // example: for device_vectors dv_in and dv_out:
- // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+
+
+ thrust::exclusive_scan(dev_thrust_idata, dev_thrust_idata + n, dev_thrust_odata);
+
+
timer().endGpuTimer();
+
+ cudaMemcpy(odata, dev_odata, sizeof(int) * n, cudaMemcpyDeviceToHost);
+ checkCUDAError("memcpy dev_odata to odata failed!");
+
+ cudaFree(dev_idata);
+ cudaFree(dev_odata);
+ checkCUDAError("cudaFree failed!");
+ }
+
+ int compact(int n, int* odata, const int* idata)
+ {
+ int* dev_data;
+
+ cudaMalloc((void**)&dev_data, sizeof(int) * n);
+ checkCUDAError("cudaMalloc dev_data failed!");
+
+ cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+ checkCUDAError("memcpy idata to dev_data failed!");
+
+ thrust::device_ptr dev_thrust_data(dev_data);
+
+ timer().startGpuTimer();
+
+
+ thrust::device_ptr new_end = thrust::remove_if(dev_thrust_data, dev_thrust_data + n, is_zero());
+
+
+ timer().endGpuTimer();
+
+ // Calculate the number of elements remaining after removal
+ int compactedCount = new_end - dev_thrust_data;
+
+ cudaMemcpy(odata, dev_data, sizeof(int) * compactedCount, cudaMemcpyDeviceToHost);
+ checkCUDAError("memcpy dev_data to odata failed!");
+
+ // Free device memory
+ cudaFree(dev_data);
+ checkCUDAError("cudaFree failed!");
+
+ return compactedCount;
}
}
}
diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h
index fe98206..d5c82ec 100644
--- a/stream_compaction/thrust.h
+++ b/stream_compaction/thrust.h
@@ -7,5 +7,7 @@ namespace StreamCompaction {
StreamCompaction::Common::PerformanceTimer& timer();
void scan(int n, int *odata, const int *idata);
+
+ int compact(int n, int* odata, const int* idata);
}
}