sorting - Parallel radix sort, how would this implementation actually work? Are there some heuristics? -

i working on udacity quiz parallel programming course. pretty stuck on how should start on assignment because not sure if understand correctly.

for assignment (in code) given 2 arrays , array on values , array of positions. supposed sort array of values parallelized radix sort, along setting positions correctly too.

i understand radix sort , how works. don't understand how want implemented it. here template given start assignment

//udacity hw 4 //radix sorting  #include "reference_calc.cpp" #include "utils.h"  /*  reddish eye removal    ===============     assignment implementing  reddish eye removal.     accomplished first creating score every pixel tells how     reddish eye pixel.  have done -    receiving scores , need sort them in ascending order    know pixels alter remove  reddish eye.     note: ascending order == smallest largest     each score associated position, when sort scores, must    move positions accordingly.     implementing parallel radix sort cuda    ==========================================     basic  thought  build histogram on each pass of how many of each    "digit" there are.   scan histogram know  set    output of each digit.  example, first 1 must come after    0s have know how many 0s there able start moving 1s     right position.     1) histogram of number of occurrences of each digit    2) exclusive prefix sum of histogram    3) determine relative offset of each digit          illustration [0 0 1 1 0 0 1]                 ->  [0 1 0 1 2 3 2]    4) combine results of steps 2 & 3 determine final       output location each element , move there     lsb radix sort out-of-place sort , need ping-pong values    between input , output buffers have provided.   create sure final    sorted results end in output buffer!  hint: may need  re-create    @ end.   */   void your_sort(unsigned int* const d_inputvals,                unsigned int* const d_inputpos,                unsigned int* const d_outputvals,                unsigned int* const d_outputpos,                const size_t numelems) {  }

i don't understand how 4 steps end sorting array.

so first step, supposed create histogram of "digits" (why in quotes..?). given input value n need create count of 0's , 1's histogram. so, should step 1 create array of histograms, 1 each input value?

and well, rest of steps breaks downwards pretty quickly. show me how these steps supposed implement radix sort?

the basic thought behind radix sort consider each element sorted digit digit, to the lowest degree important significant. each digit, move elements those digits in increasing order.

let's take simple example. let's sort 4 quantities, each of have 4 binary digits. let's take 1, 4, 7, , 14. we'll mix them , visualize binary representation:

element # 1 2 3 4 value: 7 14 4 1 binary: 0111 1110 0100 0001

first consider bit 0:

element #    1       2       3       4 value:       7       14      4       1 binary:      0111    1110    0100    0001 bit 0:       1       0       0       1

now radix sort algorithm says must move elements in such way (considering bit 0) zeroes on left, , ones on right. let's while preserving order of elements 0 bit , preserving order of elements 1 bit. this:

element #    2       3       1       4 value:       14      4       7       1 binary:      1110    0100    0111    0001 bit 0:       0       0       1       1

the first step of our radix sort complete. next step consider next (binary) digit:

element #    3       2       1       4 value:       4       14      7       1 binary:      0100    1110    0111    0001 bit 1:       0       1       1       0

once again, must move elements digit in question (bit 1) arranged in ascending order:

element #    3       4       2       1 value:       4       1       14      7 binary:      0100    0001    1110    0111 bit 1:       0       0       1       1

now must move next higher digit:

element #    3       4       2       1 value:       4       1       14      7 binary:      0100    0001    1110    0111 bit 2:       1       0       1       1

and move them again:

element #    4       3       2       1 value:       1       4       14      7 binary:      0001    0100    1110    0111 bit 2:       0       1       1       1

now move lastly (highest order) digit:

element #    4       3       2       1 value:       1       4       14      7 binary:      0001    0100    1110    0111 bit 3:       0       0       1       0

and create our final move:

element #    4       3       1       2 value:       1       4       7       14 binary:      0001    0100    0111    1110 bit 3:       0       0       0       1

and values sorted. seems clear, in description far we've glossed on details of things "how know elements move?" , "how know set them?" let's repeat our example, we'll utilize specific methods , sequence suggested in prompt, in order reply these questions. starting on bit 0:

element #    1       2       3       4 value:       7       14      4       1 binary:      0111    1110    0100    0001 bit 0:       1       0       0       1

first let's build histogram of number of 0 bits in bit 0 position, , number of 1 bits in bit 0 position:

bit 0:       1       0       0       1                0 bits       1 bits               ---------       -------- 1)histogram:         2              2

now let's exclusive prefix-sum on these histogram values:

              0 bits       1 bits               ---------       -------- 1)histogram:         2              2 2)prefix sum:        0              2

an exclusive prefix-sum sum of preceding values. there no preceding values in first position, , in sec position preceding value 2 (the number of elements 0 bit in bit 0 position). now, independent operation, let's determine relative offset of each 0 bit amongst 0 bits, , each 1 bit amongst 1 bits:

bit 0: 1 0 0 1 3)offset: 0 0 1 1

this can done programmatically using exclusive prefix-sums again, considering 0-group , 1-group separately, , treating each position if has value of 1:

0 bit 0:             1       1        3)ex. psum:          0       1      1 bit 0:     1                        1       3)ex. psum:  0                        1

now, step 4 of given algorithm says:

4) combine results of steps 2 & 3 determine final output location each element , move there

what means is, each element, select histogram-bin prefix sum value corresponding bit value (0 or 1) , add together that, offset associated position, determine location move element to:

element #    1       2       3       4 value:       7       14      4       1 binary:      0111    1110    0100    0001 bit 0:       1       0       0       1 hist psum:   2       0       0       2 offset:      0       0       1       1 new index:   2       0       1       3

moving each element "new index" position, have:

element # 2 3 1 4 value: 14 4 7 1 binary: 0111 1110 0111 0001

which result expect completion of our first digit-move, based on previous walk-through. has completed step 1, i.e. first (least-significant) digit; still have remaining digits process, creating new histogram , new prefix sums @ each step.

notes:

radix-sort, in computer, not have done based strictly on binary digits. it's possible build similar algorithm digits of different sizes, perhaps consisting of 2,3, or 4 bits. one of optimizations can perform on radix sort sort based on number of digits meaningful. example, if storing quantities in 32-bit values, know largest quantity nowadays 1023 (2^10-1), need not sort on 32 bits. can stop, expecting proper sort, after proceeding through first 10 bits. what of have gpus? in far above description goes, not much. practical application consider using parallel algorithms things histogram, prefix-sums, , info movement. decomposition of radix-sort allows 1 locate , utilize parallel algorithms developed these more basic operations, in order build fast parallel sort.

what follows worked example. may help understanding of radix sort. don't think help assignment, because illustration performs 32-bit radix sort @ warp level, single warp, ie. 32 quantities. possible advantage understanding point of view things histogramming , prefix sums can done @ warp level in few instructions, taking advantage of various cuda intrinsics. assignment, won't able utilize these techniques, , need come full-featured parallel prefix sums, histograms, etc. can operate on arbitrary dataset size.

#include <stdio.h> #include <stdlib.h> #define wsize 32 #define loops 100000 #define upper_bit 31 #define lower_bit 0  __device__ unsigned int ddata[wsize];  // naive warp-level bitwise radix sort  __global__ void mykernel(){   __shared__ volatile unsigned int sdata[wsize*2];   // load global shared variable   sdata[threadidx.x] = ddata[threadidx.x];   unsigned int bitmask = 1<<lower_bit;   unsigned int offset  = 0;   unsigned int thrmask = 0xffffffffu << threadidx.x;   unsigned int mypos;   //  each lsb msb   (int = lower_bit; <= upper_bit; i++){     unsigned int mydata = sdata[((wsize-1)-threadidx.x)+offset];     unsigned int mybit  = mydata&bitmask;     // population of ones , zeroes (cc 2.0 ballot)     unsigned int ones = __ballot(mybit); // cc 2.0     unsigned int zeroes = ~ones;     offset ^= wsize; // switch ping-pong buffers     // zeroes, ones     if (!mybit) // threads 0 bit       // position in ping-pong buffer       mypos = __popc(zeroes&thrmask);     else        // threads 1 bit       // position in ping-pong buffer       mypos = __popc(zeroes)+__popc(ones&thrmask);     // move buffer  (or  utilize shfl cc 3.0)     sdata[mypos-1+offset] = mydata;     // repeat next bit     bitmask <<= 1;     }   // save results global   ddata[threadidx.x] = sdata[threadidx.x+offset];   }  int main(){    unsigned int hdata[wsize];   (int lcount = 0; lcount < loops; lcount++){     unsigned int range = 1u<<upper_bit;     (int = 0; < wsize; i++) hdata[i] = rand()%range;     cudamemcpytosymbol(ddata, hdata, wsize*sizeof(unsigned int));     mykernel<<<1, wsize>>>();     cudamemcpyfromsymbol(hdata, ddata, wsize*sizeof(unsigned int));     (int = 0; < wsize-1; i++) if (hdata[i] > hdata[i+1]) {printf("sort error @ loop %d, hdata[%d] = %d, hdata[%d] = %d\n", lcount,i, hdata[i],i+1, hdata[i+1]);  homecoming 1;}     // printf("sorted data:\n");     //for (int = 0; < wsize; i++) printf("%u\n", hdata[i]);     }   printf("success!\n");    homecoming 0; }

sorting cuda parallel-processing

Search This Blog

Jaimee

sorting - Parallel radix sort, how would this implementation actually work? Are there some heuristics? -

Comments

Post a Comment

Popular posts from this blog

javascript - THREE.js reposition vertices for RingGeometry -

javascript - I need to update the text of a paragraph by inline edit -

assembly - What is the addressing mode for ld, add, and rjmp instructions? -