r - Most efficient way to loop through each observation in a data frame -
r - Most efficient way to loop through each observation in a data frame -
i'm trying find efficient way loop through info frame , cluster observations groups of 5. example, if have:
group <- c(1,2,3,4,5,6,7,8,9,10) people <- c(1,2,3,4,4,3,2,1,2,3) avg_age <- c(5,10,15,20,25,30,35,40,45,50) info <- data.frame(group,people,age)
this should generate
grouping people avg_age 1 1 1 5 2 2 2 10 3 3 3 15 4 4 4 20 5 5 4 25 6 6 3 30 7 7 2 35 8 8 1 40 9 9 1 45 10 10 2 50
i'd create "cluster" of groups @ to the lowest degree 5 people in weighted average age "cluster." i'd in efficient way going through info set , sequentially adding groups until "cluster" made @ to the lowest degree 5 people. our info should like:
grouping people age cluster tot_ppl avg_age 1 1 1 5 1 6 11.67 2 2 2 10 1 6 11.67 3 3 3 15 1 6 11.67 4 4 4 20 2 8 22.5 5 5 4 25 2 8 22.5 6 6 3 30 3 5 32 7 7 2 35 3 5 32 8 8 1 40 4 6 46.67 9 9 2 45 4 6 46.67 10 10 3 50 4 6 46.67
i'd on dataset 10,000 observations instead of 10. have thought of efficient way of going this?
here's i've got far, however, of info samples i'm working with, there closer 2 1000000 observations can take quite while run...
data$cluster <- 0 count=0 while (min(data$cluster)==0) #while (max(data$cluster)<=10) { count = count+1 data$cum <- ave(data$people, by=list(data$zipcode,data$cluster), fun=cumsum) data$a <- floor(data$cum/10) data$b <- data$cum-data$n1 data$c <- floor(data$b/10) data$cluster[data$c==0] = data$cluster[data$c==0]+1 } extravars <- c('cum','a','b','c') (inc.source in extravars){ eval(parse(text = paste("data$",inc.source,"<-null",sep=""))) } data$tot_ppl <- ave(data$people, by=list(data$zipcode,data$cluster), fun=sum) data$cluster[data$tot_ppl<10]=data$cluster[data$tot_ppl<10]+1 data$tot_ppl <- ave(data$people, by=list(data$zipcode,data$cluster), fun=sum) data2 <- info (i in 3:(ncol(data2)-3)){ data2$x <- data2[ ,i]*data2$tot_ppl data2$x <- ave(data2$x, by=list(data2$zipcode,data2$cluster), fun=sum) data2$x <- round(data2$x/data2$tot_ppl,digits=2) data2[ ,i] = data2$x } data2$x <- null
so while works, takes few hours run, if knows way create more efficient or improve it, i'd appreciate it. thanks!
i can't think of clever way vectorize operation, utilize loop in r:
purer <- function(x, lim) { cs <- cumsum(x) newgroup <- rep(false, length(x)) prevsum <- 0 (i in 1:length(newgroup)) { if (cs[i] - prevsum >= lim) { newgroup[i] <- true prevsum <- cs[i] } } return(1+c(0, head(cumsum(newgroup), -1))) } purer(dat$people, 5) # [1] 1 1 1 2 2 3 3 4 4 4
you can utilize rcpp
bundle speed non-vectorized computations:
library(rcpp) rcpp <- cppfunction(" numericvector rcpp(numericvector x, const double limit) { numericvector result(x.size()); result[0] = 1; double acc = x[0]; (int i=1; < x.size(); ++i) { if (acc >= limit) { result[i] = result[i-1] + 1; acc = x[i]; } else { result[i] = result[i-1]; acc += x[i]; } } homecoming result; } ") rcpp(dat$people, 5) # [1] 1 1 1 2 2 3 3 4 4 4
finally, can benchmark on dataset 10,000 observations:
set.seed(144) dat2 <- dat[sample(1:nrow(dat), 10000, replace=true),] library(microbenchmark) microbenchmark(purer(dat2$people, 5), rcpp(dat2$people, 5)) # unit: microseconds # expr min lq mean median uq max neval # purer(dat2$people, 5) 7073.571 7287.733 8665.394 7822.639 8749.232 31313.946 100 # rcpp(dat2$people, 5) 90.309 98.241 129.120 118.351 136.210 324.866 100
while rcpp code more 60x faster pure r implementation, pure r implementation still running in less 10 milliseconds dataset of size 10,000, fine you.
r loops vectorization
Comments
Post a Comment