GMTs - 2016-11-22

[ SOCVFinder ] possible-duplicate r copying rows based on column values

Hello

7:34 AM

[ SOCVFinder ] possible-duplicate r Error: promise already under evaluation: recursive default argument

@Queen r 4cv

zx8754 cherry-pick in r scanned 1011 questions between Jun 16 09:05 and Nov 22 08:28 filtered and ordered: 20 in batch 35

@Queen Done

zx8754 Thank you for your effort, you reviewed 20 questions, I counted 17 (85%) close votes and 13 questions closed

hello hello :-)

7:48 AM

NSE... can't get it to work

8:01 AM

licz <- function(dat, trait){
dat %>% group_by_(~type) %>%
summarise_(sr = lazyeval::interp(~mean(x, na.rm = T), x = as.name(trait)))
}

licz(df, 'D10')

@Axeman phew, that looks easy :)

please post as a comment, OP must be dying too.

Did you have to look up the manuals, or ?

I really like the simplicity of dplyr NSE, but the SE variants can be painful (this is relatively ok).

No, I've done this many times

etienne

good morning everyone

Once you realize you should build the call with interp it's quite ok. It becomes more challenging when you want your function to also use NSE

@Axeman if you can improve the question then this post can stay open and deserves an answer?

8:09 AM

@zx8754 I've answered a very similar question before, I can look it up for a dupe

found it

[ SOCVFinder ] possible-duplicate r What is the way to use dplyr in the functions

nse

Heh, we can go through all nse posts and link all to each other as dupes

[ SOCVFinder ] possible-duplicate r R - recommenderlab- evaluationScheme

9:02 AM

[ SOCVFinder ] possible-duplicate r Code getting re-execute while resize the page window

erasmortg

morning peeps

9:28 AM

Hello

9:59 AM

[ SOCVFinder ] possible-duplicate r In the below dataset, I want to split the address column after '-' and separate the column in R with respective columns attached to it

Hammer plz

Procrastinatus Maximus

10:32 AM

hello

etienne

hi

10:46 AM

[ SOCVFinder ] possible-duplicate r Reshaping data with R

Sotos

For crying out loud... close

Procrastinatus Maximus

@Queen k

typo ?

Just killed a meta question :p

Roland

11:32 AM

Please reopen: stackoverflow.com/q/40735175/1412059 OP has improved their question.

12:13 PM

@Roland open

close stackoverflow.com/questions/40736822

Sotos

1:21 PM

@DavidArenburg so that's what NA^... does

user2100721

NA^ part is awesome.

2:00 PM

[ SOCVFinder ] possible-duplicate r How can i filter column values in Rstudio?

2:35 PM

[ SOCVFinder ] possible-duplicate r How to extract numbers from text?

2:58 PM

@Cath Ok, removed that part. Now remove your comments and go away — akrun 1 min ago

flagged as rude

@RonakShah lol that's nice of you (so you're the upvote on my first comment :-) ) it is so annoying to know he was waiting to understand what OP wanted, then he just sees my answer and so find in 2 sec another way of doing the same, because this is what he's good at

@RonakShah your flag probably returned helpful :-)

[ SOCVFinder ] possible-duplicate r R data frame subsetting based on a column value frequency threshold

I kind of want to rollback his edits to leave the obvious copying/pasting part...

@Cath yes, he obviously is good at it...Try doing the same with him and you'll see how furious he becomes.

btw I am also the upvote on your answer ;-)

@RonakShah that's for sure !

@RonakShah thanks :-) (I'm like 2 points from my own hammer now I think ! :-) )

Q: R data frame subsetting based on a column value frequency threshold

3:06 PM

@Cath that would be one big achievement...Would definitely be helpful to avoid answers on duplicate questions like these

0

I am a new R user and this is my first question submission (hopefully in compliance with the protocol). I have a data frame with two columns. df <- data.frame(v1 = c("A", "A", "B", "B", "B", "B", "C", "D", "D", "E" )) dfc <- df %>% count(v1) df$n <- with(dfc, n[match(df$v1,v1)]) v1 n 1 ...

r

yes, that's a "generic" Q/A... I think he just cannot help it... (stil answering dupes)

he would be mader once I join the hammer team :-)

Well, I think we should try to unfocus from him, just let him be and act whatever he's pleased. Just flag inapropriate content (comments) and do the usual close/dupe when needed. And ignore him if call after you for a closure.

I feel like stepping in David's shoes

@Tensibai hey @Tens :-)

Hello @Cath :)

yep, problem is, when akrun asks OP for desired output, then I answer, then, now tha the knows what needs to be computed, he posted a different, but with a part of copying/pasting, answer, it just so much annoys me... argh... ;-)

3:15 PM

I saw that, just ignore him, that won't change the face of internet

and even less of the world

lol indeed and it will make me live less (less longer) :-/

Just dont give him what he wants...

Attention/Upvotes :-P

Wooohoo, I should have not use 1e7 as source for this microbenchmark...

hmm comments were cleaned, a mod must have been passing by

Unit: milliseconds
               expr        min         lq       mean     median         uq        max neval    cld
          akrun(df)   487.2707   510.8357   546.1012   521.3872   537.4420   700.1267    10 a
 GGrothendieck4(df)   561.5427   576.1987   604.9359   589.0564   609.5453   696.4785    10 ab
 GGrothendieck3(df)   574.0275   584.4956   636.6241   612.6072   695.6640   729.3624    10 ab
 GGrothendieck2(df)   660.0946   712.2549   731.0311   741.4083   758.0153   759.0544    10  b
 NineHeightNine(df)  1204.6817  1223.2428  1253.4177  1250.8204  1277.2377  1340.0811    10   c

3:18 PM

who is nineightnine ?

989 is his/her nickname, as that should not work as a function name, I spelled the numbers

From the NA^ above

another new answer ? I don't think I saw this one

hmm actually I did :-/

@Tens nice benchmarks, have you tried with more columns ?

Nope

Pardon for the language but I feel like I can't work anymore today

so what's up peeps

@Tensibai I think that would be interesting, with a varying proportion of Nas

@DavidArenburg hey Dave :-)

3:22 PM

@Cath hey

one vote please

3 answers already

@RonakShah already did :-/

Wow: alexis_laz version with complete.cases beat eveything

@Tensibai but he didn't post...

He did comment under G.Grothendieck post

3:25 PM

@Tensibai yep I saw that but the answer wasn't added (or was it ?)

yeah, his verisons always the best

@Tensibai you didn't put the benchmarks for complete.cases did you ?

though many of these answers don't reach the desired output exactly

2

Just added to the benchmark (on 1e6 rows, too lazy to rerun on 1e7)

@DavidArenburg I did though about using as.integer to get a numeric output but, even with it it won't match the exact desired output indeed

@Tensibai as you have all functions etc, you mind checking with like 10^5 columns ?

3:28 PM

I actually haven't expected this answer/question to drow so many attention. I've posted in comments cause I thought its just a dupe

@Tensibai nvm I can have it too ;-p

@Tensibai If that top_n answer got an upvote I really think it's unfair nurka got two downvotes

@DavidArenburg ???

@Tensibai I'm just agreeing with you...

Oh, didn't check to which message you were refering too

3:34 PM

@DavidArenburg it would probably have had a DV too if the nurka's downvoter saw it but the other answers were posted some times after nurka's

@Cath on 26 columns of 1e6 rows:

Unit: milliseconds
               expr        min         lq       mean     median         uq        max neval     cld
     alexis_laz(df)   90.93923   91.91509   93.42776   92.75552   95.31161   97.16838    10 a
 GGrothendieck2(df)  248.88182  260.67500  316.89156  318.07621  365.61113  391.87275    10  b
 GGrothendieck3(df)  470.91760  484.56498  504.83493  491.27278  520.24358  608.75102    10   c
 GGrothendieck4(df)  484.63140  498.03247  519.47379  505.21580  526.98018  630.48404    10   c

now compare outputs

@Tensibai hmm I bet things would change a bit more with like 10^3 rows and 10^4 columns

c(NA, 1)[(complete.cases(df))+1] works I think. Why did alexis negate the complete.cases ?

or NA^!complete.cases(df)

I'm on fire with the NA abuse today

2

@DavidArenburg probably more time-greedy, no ?

3:40 PM

@Cath time-greedy?

maybe because of the !, don't know

Elections ends in 4 hours btw

so stay tuned

@DavidArenburg yeah I have no idea whant I'm tlaking about (I just thought your sol would be the fastest and as it's not, I just guessed ^ must take some time...)

@erasmortg Hi, mate. Good to see you

@DavidArenburg yep, voted for Bhargav, ArtOf and Andy...

@Cath my solution has both matrix conversion and unnecessary rowSums

@Cath Any hint ot produce such a DF in a few characters ?

3:43 PM

As I said, I thought it's going to be closed as a dupe within seconds so didn't put match thinking into it.

@Tensibai I did:

f <- data.frame(matrix(round(runif(1000*1000, 1, 100)), ncol=1000))
df <- apply(df, 2, function(x) {x[sample(1:1000, 10, replace=FALSE)] <- NA; x})

but it's quite lame...

and 10 NAs per column is terrible (I put 1000x1000 because I have just 4GB of RAM on my "regular" pc)

df <- as.data.frame(matrix(sample(1:5, 1e7, replace = TRUE), ncol = 1e4)) ?

Or am I missing something

ok, going home, cya lads

hi all, bye David

what's this thing y'all are benchmarking on?

@DavidArenburg nice evening :-)

I get an error for do.call(pmax...) (and pmin... as well)

user2100721

@Frank this

3:48 PM

ah ok thanks

yeesh, that's a lot of benchmarkables in Tens' answer

hi @Frank

hiya Cath

Just killed my rstudio and Chrome at the same time running:

@Tens, I get:

user2100721

@Frank you came late...you can join them now and show us some other approach :P

3:51 PM

Maybe I should turn this answer community wiki

Q: fastest way to count the number of rows in a data frame that has at least one NA

@user2100721 heh, i think complete.cases is fast enough :)

fyi, similar q from yesterday:

-1

When you have the data set, usually you want to see that is the fraction of rows that has at least one NA (or missing value) in the data set. In R, what I did is the following: TR = apply(my_data,1,anyNA) sum(TR)/length(TR) But I found that if my data set has 1 million rows, it takes some tim...

r

             expr         min          lq        mean      median          uq         max neval cld
     alexis_laz(df)    5.032864    5.041418    5.823552    5.054080    5.229963   12.327915    10 a
          David(df)    5.777459    5.826049    6.763159    6.317257    6.668168    9.989077    10 a
 GGrothendieck2(df)    5.995431    6.766718    7.262373    7.175116    7.735102    9.140458    10 a
 GGrothendieck5(df)   35.914074   38.091742   48.423107   43.997178   47.047077   99.019229    10 ab

@Tens, to make your answer "perfect", you also need a benchmark for "lots of columns" and probably more than 10 repeats

@Cath working on it

:-)

@Tensibai btw you mispelled "eight" (you put "height" instead) and now 989 is complaining ;-)

and I'm off

have a nice evening all :-) (or full day for some !!)

cya Cath, you too

4:22 PM

It takes forever to benchmark on a 1e5 columns by 1e3 rows df

erasmortg

@DavidArenburg hi! Yeah, work has been crazy lately, barely enough time for code/SO, mostly about answering emails, how about you? :)

@Tensibai if you know that one of the benchmarkables is gonna be a lot slower than the others, could drop it, only comparing the top contenders, eh

@Frank I've no idea what the result would be with 1e5 columns