r - Most efficient way to loop through each observation in a data frame -

i'm trying find efficient way loop through info frame , cluster observations groups of 5. example, if have:

group <- c(1,2,3,4,5,6,7,8,9,10) people <- c(1,2,3,4,4,3,2,1,2,3) avg_age <- c(5,10,15,20,25,30,35,40,45,50)   info <- data.frame(group,people,age)

this should generate

    grouping people avg_age 1      1      1   5 2      2      2  10 3      3      3  15 4      4      4  20 5      5      4  25 6      6      3  30 7      7      2  35 8      8      1  40 9      9      1  45 10    10      2  50

i'd create "cluster" of groups @ to the lowest degree 5 people in weighted average age "cluster." i'd in efficient way going through info set , sequentially adding groups until "cluster" made @ to the lowest degree 5 people. our info should like:

    grouping people age cluster tot_ppl avg_age 1      1      1   5       1       6   11.67 2      2      2  10       1       6   11.67 3      3      3  15       1       6   11.67 4      4      4  20       2       8    22.5 5      5      4  25       2       8    22.5 6      6      3  30       3       5      32 7      7      2  35       3       5      32 8      8      1  40       4       6   46.67 9      9      2  45       4       6   46.67 10    10      3  50       4       6   46.67

i'd on dataset 10,000 observations instead of 10. have thought of efficient way of going this?

here's i've got far, however, of info samples i'm working with, there closer 2 1000000 observations can take quite while run...

data$cluster <- 0 count=0  while (min(data$cluster)==0) #while (max(data$cluster)<=10) { count = count+1 data$cum <- ave(data$people, by=list(data$zipcode,data$cluster), fun=cumsum)  data$a <- floor(data$cum/10) data$b <- data$cum-data$n1 data$c <- floor(data$b/10) data$cluster[data$c==0] = data$cluster[data$c==0]+1 }  extravars <- c('cum','a','b','c') (inc.source in extravars){   eval(parse(text = paste("data$",inc.source,"<-null",sep="")))          }  data$tot_ppl <- ave(data$people, by=list(data$zipcode,data$cluster), fun=sum)  data$cluster[data$tot_ppl<10]=data$cluster[data$tot_ppl<10]+1 data$tot_ppl <- ave(data$people, by=list(data$zipcode,data$cluster), fun=sum)  data2 <-   info   (i in 3:(ncol(data2)-3)){   data2$x <- data2[ ,i]*data2$tot_ppl   data2$x <- ave(data2$x, by=list(data2$zipcode,data2$cluster), fun=sum)   data2$x <- round(data2$x/data2$tot_ppl,digits=2)   data2[ ,i] = data2$x }  data2$x <- null

so while works, takes few hours run, if knows way create more efficient or improve it, i'd appreciate it. thanks!

i can't think of clever way vectorize operation, utilize loop in r:

purer <- function(x, lim) {     cs <- cumsum(x)     newgroup <- rep(false, length(x))     prevsum <- 0     (i in 1:length(newgroup)) {         if (cs[i] - prevsum >= lim) {             newgroup[i] <- true             prevsum <- cs[i]         }     }     return(1+c(0, head(cumsum(newgroup), -1))) } purer(dat$people, 5) # [1] 1 1 1 2 2 3 3 4 4 4

you can utilize rcpp bundle speed non-vectorized computations:

library(rcpp) rcpp <- cppfunction(" numericvector rcpp(numericvector x, const double limit) {     numericvector result(x.size());     result[0] = 1;     double acc = x[0];     (int i=1; < x.size(); ++i) {         if (acc >= limit) {             result[i] = result[i-1] + 1;             acc = x[i];         } else {             result[i] = result[i-1];             acc += x[i];         }     }      homecoming result; } ") rcpp(dat$people, 5) # [1] 1 1 1 2 2 3 3 4 4 4

finally, can benchmark on dataset 10,000 observations:

set.seed(144) dat2 <- dat[sample(1:nrow(dat), 10000, replace=true),] library(microbenchmark) microbenchmark(purer(dat2$people, 5), rcpp(dat2$people, 5)) # unit: microseconds #                   expr      min       lq     mean   median       uq       max neval #  purer(dat2$people, 5) 7073.571 7287.733 8665.394 7822.639 8749.232 31313.946   100 #   rcpp(dat2$people, 5)   90.309   98.241  129.120  118.351  136.210   324.866   100

while rcpp code more 60x faster pure r implementation, pure r implementation still running in less 10 milliseconds dataset of size 10,000, fine you.

r loops vectorization

Search This Blog

Jaimee

r - Most efficient way to loop through each observation in a data frame -

Comments

Post a Comment

Popular posts from this blog

c - Compilation of a code: unkown type name string -

java - Bypassing "final local variable defined in an enclosing type" -

json - Hibernate and Jackson (java.lang.IllegalStateException: Cannot call sendError() after the response has been committed) -