sorting - Parallel radix sort, how would this implementation actually work? Are there some heuristics? -
sorting - Parallel radix sort, how would this implementation actually work? Are there some heuristics? -
i working on udacity quiz parallel programming course. pretty stuck on how should start on assignment because not sure if understand correctly.
for assignment (in code) given 2 arrays , array on values , array of positions. supposed sort array of values parallelized radix sort, along setting positions correctly too.
i understand radix sort , how works. don't understand how want implemented it. here template given start assignment
//udacity hw 4 //radix sorting #include "reference_calc.cpp" #include "utils.h" /* reddish eye removal =============== assignment implementing reddish eye removal. accomplished first creating score every pixel tells how reddish eye pixel. have done - receiving scores , need sort them in ascending order know pixels alter remove reddish eye. note: ascending order == smallest largest each score associated position, when sort scores, must move positions accordingly. implementing parallel radix sort cuda ========================================== basic thought build histogram on each pass of how many of each "digit" there are. scan histogram know set output of each digit. example, first 1 must come after 0s have know how many 0s there able start moving 1s right position. 1) histogram of number of occurrences of each digit 2) exclusive prefix sum of histogram 3) determine relative offset of each digit illustration [0 0 1 1 0 0 1] -> [0 1 0 1 2 3 2] 4) combine results of steps 2 & 3 determine final output location each element , move there lsb radix sort out-of-place sort , need ping-pong values between input , output buffers have provided. create sure final sorted results end in output buffer! hint: may need re-create @ end. */ void your_sort(unsigned int* const d_inputvals, unsigned int* const d_inputpos, unsigned int* const d_outputvals, unsigned int* const d_outputpos, const size_t numelems) { } i don't understand how 4 steps end sorting array.
so first step, supposed create histogram of "digits" (why in quotes..?). given input value n need create count of 0's , 1's histogram. so, should step 1 create array of histograms, 1 each input value?
and well, rest of steps breaks downwards pretty quickly. show me how these steps supposed implement radix sort?
the basic thought behind radix sort consider each element sorted digit digit, to the lowest degree important significant. each digit, move elements those digits in increasing order.
let's take simple example. let's sort 4 quantities, each of have 4 binary digits. let's take 1, 4, 7, , 14. we'll mix them , visualize binary representation:
element # 1 2 3 4 value: 7 14 4 1 binary: 0111 1110 0100 0001 first consider bit 0:
element # 1 2 3 4 value: 7 14 4 1 binary: 0111 1110 0100 0001 bit 0: 1 0 0 1 now radix sort algorithm says must move elements in such way (considering bit 0) zeroes on left, , ones on right. let's while preserving order of elements 0 bit , preserving order of elements 1 bit. this:
element # 2 3 1 4 value: 14 4 7 1 binary: 1110 0100 0111 0001 bit 0: 0 0 1 1 the first step of our radix sort complete. next step consider next (binary) digit:
element # 3 2 1 4 value: 4 14 7 1 binary: 0100 1110 0111 0001 bit 1: 0 1 1 0 once again, must move elements digit in question (bit 1) arranged in ascending order:
element # 3 4 2 1 value: 4 1 14 7 binary: 0100 0001 1110 0111 bit 1: 0 0 1 1 now must move next higher digit:
element # 3 4 2 1 value: 4 1 14 7 binary: 0100 0001 1110 0111 bit 2: 1 0 1 1 and move them again:
element # 4 3 2 1 value: 1 4 14 7 binary: 0001 0100 1110 0111 bit 2: 0 1 1 1 now move lastly (highest order) digit:
element # 4 3 2 1 value: 1 4 14 7 binary: 0001 0100 1110 0111 bit 3: 0 0 1 0 and create our final move:
element # 4 3 1 2 value: 1 4 7 14 binary: 0001 0100 0111 1110 bit 3: 0 0 0 1 and values sorted. seems clear, in description far we've glossed on details of things "how know elements move?" , "how know set them?" let's repeat our example, we'll utilize specific methods , sequence suggested in prompt, in order reply these questions. starting on bit 0:
element # 1 2 3 4 value: 7 14 4 1 binary: 0111 1110 0100 0001 bit 0: 1 0 0 1 first let's build histogram of number of 0 bits in bit 0 position, , number of 1 bits in bit 0 position:
bit 0: 1 0 0 1 0 bits 1 bits --------- -------- 1)histogram: 2 2 now let's exclusive prefix-sum on these histogram values:
0 bits 1 bits --------- -------- 1)histogram: 2 2 2)prefix sum: 0 2 an exclusive prefix-sum sum of preceding values. there no preceding values in first position, , in sec position preceding value 2 (the number of elements 0 bit in bit 0 position). now, independent operation, let's determine relative offset of each 0 bit amongst 0 bits, , each 1 bit amongst 1 bits:
bit 0: 1 0 0 1 3)offset: 0 0 1 1 this can done programmatically using exclusive prefix-sums again, considering 0-group , 1-group separately, , treating each position if has value of 1:
0 bit 0: 1 1 3)ex. psum: 0 1 1 bit 0: 1 1 3)ex. psum: 0 1 now, step 4 of given algorithm says:
4) combine results of steps 2 & 3 determine final output location each element , move there
what means is, each element, select histogram-bin prefix sum value corresponding bit value (0 or 1) , add together that, offset associated position, determine location move element to:
element # 1 2 3 4 value: 7 14 4 1 binary: 0111 1110 0100 0001 bit 0: 1 0 0 1 hist psum: 2 0 0 2 offset: 0 0 1 1 new index: 2 0 1 3 moving each element "new index" position, have:
element # 2 3 1 4 value: 14 4 7 1 binary: 0111 1110 0111 0001 which result expect completion of our first digit-move, based on previous walk-through. has completed step 1, i.e. first (least-significant) digit; still have remaining digits process, creating new histogram , new prefix sums @ each step.
notes:
radix-sort, in computer, not have done based strictly on binary digits. it's possible build similar algorithm digits of different sizes, perhaps consisting of 2,3, or 4 bits. one of optimizations can perform on radix sort sort based on number of digits meaningful. example, if storing quantities in 32-bit values, know largest quantity nowadays 1023 (2^10-1), need not sort on 32 bits. can stop, expecting proper sort, after proceeding through first 10 bits. what of have gpus? in far above description goes, not much. practical application consider using parallel algorithms things histogram, prefix-sums, , info movement. decomposition of radix-sort allows 1 locate , utilize parallel algorithms developed these more basic operations, in order build fast parallel sort.what follows worked example. may help understanding of radix sort. don't think help assignment, because illustration performs 32-bit radix sort @ warp level, single warp, ie. 32 quantities. possible advantage understanding point of view things histogramming , prefix sums can done @ warp level in few instructions, taking advantage of various cuda intrinsics. assignment, won't able utilize these techniques, , need come full-featured parallel prefix sums, histograms, etc. can operate on arbitrary dataset size.
#include <stdio.h> #include <stdlib.h> #define wsize 32 #define loops 100000 #define upper_bit 31 #define lower_bit 0 __device__ unsigned int ddata[wsize]; // naive warp-level bitwise radix sort __global__ void mykernel(){ __shared__ volatile unsigned int sdata[wsize*2]; // load global shared variable sdata[threadidx.x] = ddata[threadidx.x]; unsigned int bitmask = 1<<lower_bit; unsigned int offset = 0; unsigned int thrmask = 0xffffffffu << threadidx.x; unsigned int mypos; // each lsb msb (int = lower_bit; <= upper_bit; i++){ unsigned int mydata = sdata[((wsize-1)-threadidx.x)+offset]; unsigned int mybit = mydata&bitmask; // population of ones , zeroes (cc 2.0 ballot) unsigned int ones = __ballot(mybit); // cc 2.0 unsigned int zeroes = ~ones; offset ^= wsize; // switch ping-pong buffers // zeroes, ones if (!mybit) // threads 0 bit // position in ping-pong buffer mypos = __popc(zeroes&thrmask); else // threads 1 bit // position in ping-pong buffer mypos = __popc(zeroes)+__popc(ones&thrmask); // move buffer (or utilize shfl cc 3.0) sdata[mypos-1+offset] = mydata; // repeat next bit bitmask <<= 1; } // save results global ddata[threadidx.x] = sdata[threadidx.x+offset]; } int main(){ unsigned int hdata[wsize]; (int lcount = 0; lcount < loops; lcount++){ unsigned int range = 1u<<upper_bit; (int = 0; < wsize; i++) hdata[i] = rand()%range; cudamemcpytosymbol(ddata, hdata, wsize*sizeof(unsigned int)); mykernel<<<1, wsize>>>(); cudamemcpyfromsymbol(hdata, ddata, wsize*sizeof(unsigned int)); (int = 0; < wsize-1; i++) if (hdata[i] > hdata[i+1]) {printf("sort error @ loop %d, hdata[%d] = %d, hdata[%d] = %d\n", lcount,i, hdata[i],i+1, hdata[i+1]); homecoming 1;} // printf("sorted data:\n"); //for (int = 0; < wsize; i++) printf("%u\n", hdata[i]); } printf("success!\n"); homecoming 0; } sorting cuda parallel-processing
Comments
Post a Comment