What does this mean - Levene's test is not appropriate with quantitative explanatory variables.(r/AskStatistics)

submitted 10 hours ago by tukeysbinges to r/AskStatistics

Link:

somkoala • 2 points • submitted 6 hours ago

Levene’s test is about testing equality of variances for a given variable between groups split by a categorical variable (i.e. gender, geography). The variable by which you split is the exploratory variable. It seems that you are using a numeric one. You have two options - change your variable into a categorical one (if it really represents buckets) or use techniques for exploring relationships between quantitative variables.

tukeysbinges • 1 point • submitted 4 hours ago

that makes sense, thanks!

MrLegilimens

Ph.D.* Social Psychology

• 1 point • submitted 9 hours ago

“I got an error in my code but i didn’t think my code or data was relevant”?

I haven’t done t.test stuff in awhile but sounds like it wants it as a factor / is more than 0/1.

tukeysbinges • 1 point • submitted 9 hours ago

yeah it's pretty stupid i admit, I just thought that the problem was contained with what I'd written.

I'm just saying what i thought, not what's right , though :')

No worries though

Load more comments

tukeysbinges commented on a post in r/AskStatistics

Hypothesis Testing(r/AskStatistics)

submitted 16 hours ago by okonomiyachi to r/AskStatistics

Link:

tukeysbinges • 2 points • submitted 9 hours ago

I can't help with much I'm afraid, but I can tell you that

paired t test - think about before and after with this.... So as an example if you're testing if a group of dogs were happier in the winter or summer it would be paired if they're the same group just tested at different times.

1 sample t test - this is when you're testing a sample against a mean you already know, so if you know the mean of something is 9 or whatever, and you have some sample, then you can run a 1 sample test to see if there's a significant difference between the sample and the known mean

CI - the main thing from CI from what i recall is whether or not zero is contained in it. Or something.

no one else has commented - take what i say with a pinch of salt as i'm only learning like you.

just thought i'd mention those

okonomiyachi • 1 point • submitted 8 hours ago

Thank you so much!

tukeysbinges • 1 point • submitted 7 hours ago

no worries :)

might be worth asking another question with a specific example and explain what's confusing you, as this one might be a bit vague for some people on here, idk

Levenes test for variance - if the variances are unequal what do we use?(r/AskStatistics)

submitted 22 hours ago by tukeysbinges to r/AskStatistics

4 comments share

Link:

multi-mod • 1 point • submitted 16 hours ago

The default t-test in R is welch's, so they don't really need to check for equality of variance, just the normality assumption.

tukeysbinges • 1 point • submitted 11 hours ago

oh right, the t.test has the flag of equal variances in R though, ?

var.equal = True

Binary factor with a t.test (this is a two sample or one sample?)(r/AskStatistics)

submitted 22 hours ago by tukeysbinges to r/AskStatistics

2 comments share

Link:

[standard error] What's going on when we find z_x(r/AskStatistics)

submitted 11 days ago by tukeysbinges to r/AskStatistics

4 comments share

Link:

Mizzy3030 • 1 point • submitted 11 days ago

There is not enough information in your question to answer...Did you calculate Z_X? If so, how did you calculate it?

tukeysbinges • 1 point • submitted 11 days ago

I'm just wondering in general

I calculated the usual way, z_x = (x - mean_x)/s

Identify SD and regression lines on a graph(r/AskStatistics)

submitted 21 days ago by tukeysbinges to r/AskStatistics

4 comments share

Link:

efrique • 2 points • submitted 21 days ago

if I was to take the average point in each section of the ellipse (like this) then the regression line would go through those points

This claim is untrue; you can see it in your diagram if you look carefully.

Take the strip to the right of your rightmost vertical line. Where would the mean of the y-values of the points in there be? way below where the line relating y and x goes.

Your handrawn line is basically along the principal axis of the ellipse but the regression line would join the points where a vertical line touches the left and right extremities of the ellipse

tukeysbinges • 1 point • submitted 13 days ago

Thanks, makes sense

To double check :

we're saying that the regression line will go through the average of each vertical line.

This makes sense because the average of each point is the best guess, and that's what the regression line is (basically) doing.

efrique • 2 points • submitted 13 days ago

Yes, you're trying to find E(Y|X=x) (which will be a function of x); in the case of a regression line you assume that expectation is linear in x.

tukeysbinges • 1 point • submitted 12 days ago

cool - thanks

Regression using percentiles(r/AskStatistics)

submitted 21 days ago by tukeysbinges to r/AskStatistics

3 comments share

Link:

gattia • 2 points • submitted 21 days ago

Ahh... because both of the variables (SAT and GPA) are normalized. I.e. they are both on the same scale. The slope is equal to the correlation coefficient. If they werent in the same relative units this wouldnt be the case.

So, you already calculated how many standard deviations 30% is from the mean. -0.52. So, -0.52*0.6 = -0.312. This is the number of standard deviations which you would expect the GPA to be below the mean. This is essentially your z-score. So, if you go to a z-score lookup table (http://www.z-table.com/) and loop for the z score of -0.312 you will get a percentage of 0.378 ~0.38.

Going back to what I said before. On average you would expect a student to have this score. But since the correlation isnt perfect, yuo would expect each student to deviate around this score.

tukeysbinges • 1 point • submitted 21 days ago

ahh, yeah... that explains what I mixed up

thanks

Create a list of random integers with each value(r/Rlanguage)

submitted 21 days ago by tukeysbinges to r/Rlanguage

17 comments share

Link:

How to explain the standard deviation(r/AskStatistics)

submitted 22 days ago by tukeysbinges to r/AskStatistics

8 comments share

Link:

vmsmith • 2 points • submitted 22 days ago

It's been a while since I dabbled in this, but here are some ELI5 thoughts...

That horizontal line is our measure of central tendency. If we had to pick one number to summarize the data, that would be it. That's the signal.

But it's not enough to just state the measure of central tendency. It's like trying to imagine someone's physique by just knowing their weight. In the same way you need a height plus a weight to imagine a physique, you need both a measure of central tendency plus a measure of dispersion to summarize a data set.

Those thetas are the building blocks of our measure of dispersion.

If we are using the mean as our horizontal line (i.e., measure of central tendency) then yes, the sum of the thetas will add to zero. And so by themselves (in this case) they are useless for deriving a measure of dispersion for the entire data set.

To get a first approximation of what the total dispersion is in the data set, we need to square the thetas before adding them up. This is the sum of squares, or the sum of squared residuals. On a certain level it represents all of the noise in the data set.

Usually the sum of squares is pretty big compared to the data set. We divide by n (if dealing with an entire population) or n -1 (if dealing with a sample) to get the mean sum of squares. Now we have some idea of what the average noise level is. This is the variance.

The problem now is that our new unit of measure will not be the same as the original unit of measure...it will be the original unit of measure squared.

Hence, we take the square root to get the standard deviation.

Two notes:

One reason we use the mean as the measure of central tendency is because it is the number that results in the least amount of variance, which is good.

The reason we divide by n - 1 when working with a sample is because we only have n - 1 degrees of freedom, having already computed the mean.

tukeysbinges • 2 points • submitted 22 days ago

hrm- so without understanding degrees of freedom i can't really understand this?

Because i can set it up fine and reason it to myself if i use n... but if i use n-1 I know that i'm just going off memory there, the reasoning has gone (for me)

thanks

vmsmith • 2 points • submitted 22 days ago

Suppose I said that five numbers add up to 30, and then asked, "What are five numbers that add up to 30?"

You could choose any four numbers in the known universe as your first four, but having chosen four, the fifth is now no longer an option. It has to be a number that -- when added to the other four -- equals 30.

So you had four (n - 1) degrees of freedom in choosing the five numbers that satisfy the initial condition.

Similarly -- and in very rough terms -- when you compute the variance, the mean is already known.

It often takes a while, and a lot of thinking, to fully get your head around degrees of freedom wrt variance.

tukeysbinges • 2 points • submitted 21 days ago

Thanks.

I get what you're saying with the "choose numbers to sum to 30" thing, sort of...

Basically you're saying :

Choose 4 random numbers,
Choose another number such that the sum is 30

So we just get

    n1 + n2 + n3 + n4 = k

then our fifth number is determined as

                   n5 = 30 - k

As you say, when we're finding this spread about the mean we have the mean ( as otherwise it'd be impossible).

So this means that

(x1 + x2 + x3 + ... + x{n-1} + xn) / n = MEAN

Which we're given.

Therefore

x1 = (n * MEAN) - (x2 + x3 + ... + x{n-1} + xn)

Is determined from the above... So if we know

x2, x3, ... , xn, MEAN

Then x1 has to be some value ( in the same way that it did for the case of summing to 30)

So

What's happened is that when finding the measure of spread about the mean I haven't accounted for the fact that the mean is a function of the values.

Perhaps....?

But then I'm wondering what the problem is with dividing by n still.

It makes sense that I can rearrange and have some fixed value in terms of the others such as

x1 = (n * MEAN) - (x2 + x3 + ... + x{n-1} + xn)

So because I have the mean... I can read off the points

x1, x2, x3, ... , x{n-1}

And then I must have some fixed value for xn in order to have the mean there.

Sorry if that was a bit of a waffle, i think it kinda makes sense though?

tukeysbinges commented on a post in r/rstats

Googling around didn't get me to any obvious answer... Should be a simple question!(r/rstats)

submitted 22 days ago by VSkwidd to r/rstats

Link:

AGINSB • 1 point • submitted 22 days ago

As someone who did it the wrong way, if you are going to use dplyr you should learn it at the beginning and use it as you go. You don't need to learn the entire tidyverse right away but learning base r might be a hindrance to learning the tidyverse.

tukeysbinges • 1 point • submitted 22 days ago

hrm noted... I keep bumping into dplyr syntax and its annoying me... which probably means I should learn it

tukeysbinges commented on a post in r/rstats

Brief introduction on writing custom C++ functions to run in R(rafaelhwang.com)

submitted 22 days ago by rafaelhwang to r/rstats

Link:

tukeysbinges • 2 points • submitted 22 days ago

would be nice if you added a speed comparison as you did here https://rafaelhwang.com/2017/07/25/avoiding-costly-loops-by-approaching-problems-differently-in-r/

thanks for sharing though!

Importance of formal logic for statistics (?)(r/AskStatistics)

submitted 24 days ago by tukeysbinges to r/AskStatistics

3 comments share

Link:

The_Sodomeister • 2 points • submitted 24 days ago

Formal logic will probably not come up directly, but it will help you reason through some of the thicker theory. I don't see any way that formal logic could be a bad thing, as long as you have the time and energy for it.

tukeysbinges • 1 point • submitted 24 days ago

Yeah, is this something that you took though? And is it something that you see underlying some things ?

I don't think it could do any harm, I was just wondering if it was "surprisingly useful" or something along those lines I guess. Thanks

How to describe the influence of a (extreme) value on a variable?(r/AskStatistics)

submitted 26 days ago by tukeysbinges to r/AskStatistics

16 comments share

Link:

efrique • 1 point • submitted 26 days ago

You can write ā = (14 + 9999)/7 = 2 + 9999/7, so you can see how it's influencing the average. You could look at the empirical influence function; write your sample as 1,2,3,1,3,4,x and then see how the mean changes with x. (it's 2+x/7)

tukeysbinges • 1 point • submitted 26 days ago

Thanks. Another thing if i may - would you describe an 'extreme outlier' as a peculiarity within the data?

I mean... It seems to fit the definition of peculiar (in that it's in some way special)... but I'm not sure if this kind of thing is natural. For some reason my head wants to look for things like variable names being spelled wrong or typos or something like that?

I'm just wondering if the term "peculiarities" makes you think about looking for something particular within a dataset.

thanks

efrique • 1 point • submitted 26 days ago

I typed a reply but it seems to have gotten lost. I'll try to come back later an say something similar later

tukeysbinges • 1 point • submitted 26 days ago

ha, damn. Thanks

Load more comments

Calculate conditional probabilities(r/AskStatistics)

submitted 27 days ago by tukeysbinges to r/AskStatistics

1 comment share

Link:

tukeysbinges commented on a post in r/AskStatistics

Deconstructing Data Science(r/AskStatistics)

submitted 28 days ago by ajva1996 to r/AskStatistics

Link:

tukeysbinges • 4 points • submitted 28 days ago

I was just reading about the Dunning-Kruger effect actually, spooky.

Dividing data into categories - how to choose the category size?(r/AskStatistics)

submitted 28 days ago by tukeysbinges to r/AskStatistics

5 comments share

Link:

der1n1t1ator • 2 points • submitted 28 days ago

I would recommend looking at k-Means Clustering, whereas k for you is 5. This computes a voronoi tesselation for your data, easily speaking clustering your data into the five closest regions.

tukeysbinges • 1 point • submitted 28 days ago

thanks, i've head of but never used clustering.

Also - here's some info that I wrote in response to someone else :

the split is going to be carried out on Total variable of the people in the cars. Then there's a Dead variable, which the amount of people that died in that accident.

So I will then find the probability of surviving for each accident... which will just be (Dead/Total) I guess.

Then this will be done for the over all values, so (OverAllDead / OverAllTotal), and these figures won't be in agreement with those from the splits... though how much depends on how I split I guess.

So that I will be splitting the TotalInTheCar variable into 5 categories.... which is distributed as in the histogram of the OP.

Hopefully that's clear

COOLSerdash • 4 points • submitted 28 days ago

The source of your confusion lies in the fact that you don't seem to know what the ultimate goal of the categorization is. Why do you want to categorize the data in the first place?

There is no best way to categorize. Only more or less appropriate methods depending on the question you want to answer.

tukeysbinges • 1 point • submitted 28 days ago

Good point - the split is going to be carried out on Total variable of the people in the cars. Then there's a Dead variable, which the amount of people that died in that accident.

So I will then find the probability of surviving for each accident... which will just be (Dead/Total) I guess.

Then this will be done for the over all values, so (OverAllDead / OverAllTotal), and these figures won't be in agreement with those from the splits... though how much depends on how I split I guess.

Hopefully this is some more context.

tukeysbinges commented on a post in r/AskStatistics

Im looking for an idiot's guide to statistics(r/AskStatistics)

submitted 28 days ago by wistfulshoegazer to r/AskStatistics

Link:

tukeysbinges • 1 point • submitted 28 days ago

freedman statistics.

Convert location names to points for map plotting(r/Rlanguage)

submitted 28 days ago by tukeysbinges to r/Rlanguage

2 comments share

Link:

loess line - this looks off , right?(r/AskStatistics)

submitted 29 days ago by tukeysbinges to r/AskStatistics

11 comments share

Link:

efrique • 3 points • submitted 28 days ago

If it was "the same point repeated over and over" those points would show up like a single dot. You'd only see the very few points that weren't on that one dot.

[Or if that won't do it, two points each with thousands of repeats and the same y-value but different x-values, both on the line and then set a large loess-span]

tukeysbinges • 1 point • submitted 28 days ago

right... so it's certainly not a line for the dots on that graph then :P

efrique • 1 point • submitted 28 days ago

It would be -- they'd all be points on the graph, just not points you can visually distinguish. Coincident points is a problem with many such displays; if you cant tell one or two points have a huge weight, the plot can look wrong but it's simply that we can't see the points are stacked up.

There are some good solutions to that if you know it's a possibility (e.g. see if a jittered plot looks about the same, or use partially transparent colors)

tukeysbinges • 1 point • submitted 28 days ago

OK cheers. It's safe to say that the line in the graph i posted isn't a loess of the scatter points though right?

Load more comments

Adjusting the tick values in ggplot sensibly(r/Rlanguage)

submitted 29 days ago by tukeysbinges to r/Rlanguage

16 comments share

Link:

infrequentaccismus • 1 point • submitted 29 days ago

Your years are factored so they are discrete. It is treating 1980, 1981, 1982 the same as yellow, purple, blue. There is no value or inherent ordering. As a result, you cannot give this discrete, factored year value to scale_x_continuous(). You must either coerce your years back to numeric, or use scale_x_discrete()

tukeysbinges • 1 point • submitted 29 days ago

well I guess I'm asking people here what's best as I'm currently learning, I don't know what I should do in this situation.

I tried to run the code provided by the OP and it didn't work for reasons I outlined.

I don't know what the comment

# Years is not factor, it is numeric

Means in the context - as It seems that ( as you're saying ) years is a factor?

cheers

infrequentaccismus • 2 points • submitted 29 days ago

I don’t see that comment in the op’s code. Which user posted that comment in their code?

In the op’s code, yearLevels was explicitly coded as a factor, which makes it I possible to do math the year (for example, to see whether there is a different distance between each year level)

tukeysbinges • 2 points • submitted 29 days ago

it is numeric

https://www.reddit.com/r/Rlanguage/comments/7c8f1v/adjusting_the_tick_values_in_ggplot_sensibly/dpnzowu/

Load more comments

tukeysbinges commented on a post in r/AskStatistics

How would you come up with a percentage from normal distrubution?(i.redd.it)

submitted 29 days ago by 87iron to r/AskStatistics

Link:

tukeysbinges • 1 point • submitted 29 days ago

Your image is sideways

ggplot - alter the x axis density (?)(r/Rlanguage)

submitted 29 days ago by tukeysbinges to r/Rlanguage

17 comments share

Link:

riricide • 2 points • submitted 29 days ago

scale_x_continuous(breaks=c([insert desired tick marks on x axis]). Or use the scale_x_discrete version for your data, I think both will work. Breaks will store the vector of tick mark values you want to display on the graph.

kameltoe • 1 point • submitted 29 days ago

This is what I would do.

If you don't want to hard code the ticks, i would do something like scale_x_continuous(breaks=c(df$x %>% cut(num_breaks))

tukeysbinges • 1 point • submitted 29 days ago

i'm not familiar with this percentage syntax :S

i created a new post with code here. though, hopefully that's better explained

Load more comments

Creating a years variable from string format(r/Rlanguage)

submitted 1 month ago by tukeysbinges to r/Rlanguage

8 comments share

Link:

RShinra • 1 point • submitted 1 month ago

it would be: (say your vector is called "dates")

sapply(strsplit(dates,"/"), "[[",3)

str_split is stringr, strsplit is Base R

tukeysbinges • 1 point • submitted 1 month ago

oh ok i'll try that, thanks

RShinra • 2 points • submitted 1 month ago

with stringr (package) you can split the string by "/" and pull out the last entry with an apply function, but I will +1 the lubridate solution in case you'd like to do more with the date (which tends to be the case when you look at the data longer)

tukeysbinges • 1 point • submitted 1 month ago

OK thanks I'll try that.

I thought about writing a for loop, casting to string, grabbing the last 4 chars, setting up a new array, but it felt like a bit of a mess.

Load more comments

Group theory...?(r/AskStatistics)

submitted 1 month ago by tukeysbinges to r/AskStatistics

4 comments share

Link:

AlfredTheFifth • 2 points • submitted 1 month ago

https://en.m.wikipedia.org/wiki/Algebraic_statistics

Read this. I've found a book on it once. But I don't know much about this..

tukeysbinges • 1 point • submitted 1 month ago