my subscriptions
POPULAR-ALL-RANDOM | LOADING...MORE »
somkoala 2 points

Levene’s test is about testing equality of variances for a given variable between groups split by a categorical variable (i.e. gender, geography). The variable by which you split is the exploratory variable. It seems that you are using a numeric one. You have two options - change your variable into a categorical one (if it really represents buckets) or use techniques for exploring relationships between quantitative variables.

tukeysbinges 1 point

that makes sense, thanks!

MrLegilimens 1 point

“I got an error in my code but i didn’t think my code or data was relevant”?

I haven’t done t.test stuff in awhile but sounds like it wants it as a factor / is more than 0/1.

tukeysbinges 1 point

yeah it's pretty stupid i admit, I just thought that the problem was contained with what I'd written.

I'm just saying what i thought, not what's right , though :')

No worries though

Load more comments
tukeysbinges commented on a post in r/AskStatistics
tukeysbinges 2 points

I can't help with much I'm afraid, but I can tell you that

paired t test - think about before and after with this.... So as an example if you're testing if a group of dogs were happier in the winter or summer it would be paired if they're the same group just tested at different times.

1 sample t test - this is when you're testing a sample against a mean you already know, so if you know the mean of something is 9 or whatever, and you have some sample, then you can run a 1 sample test to see if there's a significant difference between the sample and the known mean

CI - the main thing from CI from what i recall is whether or not zero is contained in it. Or something.


no one else has commented - take what i say with a pinch of salt as i'm only learning like you.

just thought i'd mention those

okonomiyachi 1 point

Thank you so much!

tukeysbinges 1 point

no worries :)

might be worth asking another question with a specific example and explain what's confusing you, as this one might be a bit vague for some people on here, idk

efrique 2 points

if I was to take the average point in each section of the ellipse (like this) then the regression line would go through those points

This claim is untrue; you can see it in your diagram if you look carefully.

Take the strip to the right of your rightmost vertical line. Where would the mean of the y-values of the points in there be? way below where the line relating y and x goes.

Your handrawn line is basically along the principal axis of the ellipse but the regression line would join the points where a vertical line touches the left and right extremities of the ellipse

tukeysbinges 1 point

Thanks, makes sense

To double check :

we're saying that the regression line will go through the average of each vertical line.

This makes sense because the average of each point is the best guess, and that's what the regression line is (basically) doing.

efrique 2 points

Yes, you're trying to find E(Y|X=x) (which will be a function of x); in the case of a regression line you assume that expectation is linear in x.

tukeysbinges 1 point

cool - thanks

gattia 2 points

Ahh... because both of the variables (SAT and GPA) are normalized. I.e. they are both on the same scale. The slope is equal to the correlation coefficient. If they werent in the same relative units this wouldnt be the case.

So, you already calculated how many standard deviations 30% is from the mean. -0.52. So, -0.52*0.6 = -0.312. This is the number of standard deviations which you would expect the GPA to be below the mean. This is essentially your z-score. So, if you go to a z-score lookup table (http://www.z-table.com/) and loop for the z score of -0.312 you will get a percentage of 0.378 ~0.38.

Going back to what I said before. On average you would expect a student to have this score. But since the correlation isnt perfect, yuo would expect each student to deviate around this score.

tukeysbinges 1 point

ahh, yeah... that explains what I mixed up

thanks

vmsmith 2 points

It's been a while since I dabbled in this, but here are some ELI5 thoughts...

That horizontal line is our measure of central tendency. If we had to pick one number to summarize the data, that would be it. That's the signal.

But it's not enough to just state the measure of central tendency. It's like trying to imagine someone's physique by just knowing their weight. In the same way you need a height plus a weight to imagine a physique, you need both a measure of central tendency plus a measure of dispersion to summarize a data set.

Those thetas are the building blocks of our measure of dispersion.

If we are using the mean as our horizontal line (i.e., measure of central tendency) then yes, the sum of the thetas will add to zero. And so by themselves (in this case) they are useless for deriving a measure of dispersion for the entire data set.

To get a first approximation of what the total dispersion is in the data set, we need to square the thetas before adding them up. This is the sum of squares, or the sum of squared residuals. On a certain level it represents all of the noise in the data set.

Usually the sum of squares is pretty big compared to the data set. We divide by n (if dealing with an entire population) or n -1 (if dealing with a sample) to get the mean sum of squares. Now we have some idea of what the average noise level is. This is the variance.

The problem now is that our new unit of measure will not be the same as the original unit of measure...it will be the original unit of measure squared.

Hence, we take the square root to get the standard deviation.

Two notes:

One reason we use the mean as the measure of central tendency is because it is the number that results in the least amount of variance, which is good.

The reason we divide by n - 1 when working with a sample is because we only have n - 1 degrees of freedom, having already computed the mean.

tukeysbinges 2 points

hrm- so without understanding degrees of freedom i can't really understand this?

Because i can set it up fine and reason it to myself if i use n... but if i use n-1 I know that i'm just going off memory there, the reasoning has gone (for me)

thanks

vmsmith 2 points

Suppose I said that five numbers add up to 30, and then asked, "What are five numbers that add up to 30?"

You could choose any four numbers in the known universe as your first four, but having chosen four, the fifth is now no longer an option. It has to be a number that -- when added to the other four -- equals 30.

So you had four (n - 1) degrees of freedom in choosing the five numbers that satisfy the initial condition.

Similarly -- and in very rough terms -- when you compute the variance, the mean is already known.

It often takes a while, and a lot of thinking, to fully get your head around degrees of freedom wrt variance.

tukeysbinges 2 points

Thanks.

I get what you're saying with the "choose numbers to sum to 30" thing, sort of...

Basically you're saying :

Choose 4 random numbers,
Choose another number such that the sum is 30

So we just get

    n1 + n2 + n3 + n4 = k

then our fifth number is determined as

                   n5 = 30 - k

As you say, when we're finding this spread about the mean we have the mean ( as otherwise it'd be impossible).

So this means that

(x1 + x2 + x3 + ... + x{n-1} + xn) / n = MEAN

Which we're given.

Therefore

x1 = (n * MEAN) - (x2 + x3 + ... + x{n-1} + xn)

Is determined from the above... So if we know

x2, x3, ... , xn, MEAN

Then x1 has to be some value ( in the same way that it did for the case of summing to 30)

So

What's happened is that when finding the measure of spread about the mean I haven't accounted for the fact that the mean is a function of the values.

Perhaps....?

But then I'm wondering what the problem is with dividing by n still.

It makes sense that I can rearrange and have some fixed value in terms of the others such as

x1 = (n * MEAN) - (x2 + x3 + ... + x{n-1} + xn)

So because I have the mean... I can read off the points

x1, x2, x3, ... , x{n-1}

And then I must have some fixed value for xn in order to have the mean there.


Sorry if that was a bit of a waffle, i think it kinda makes sense though?

tukeysbinges commented on a post in r/rstats
AGINSB 1 point

As someone who did it the wrong way, if you are going to use dplyr you should learn it at the beginning and use it as you go. You don't need to learn the entire tidyverse right away but learning base r might be a hindrance to learning the tidyverse.

tukeysbinges 1 point

hrm noted... I keep bumping into dplyr syntax and its annoying me... which probably means I should learn it

The_Sodomeister 2 points

Formal logic will probably not come up directly, but it will help you reason through some of the thicker theory. I don't see any way that formal logic could be a bad thing, as long as you have the time and energy for it.

tukeysbinges 1 point

Yeah, is this something that you took though? And is it something that you see underlying some things ?

I don't think it could do any harm, I was just wondering if it was "surprisingly useful" or something along those lines I guess. Thanks

efrique 1 point

You can write ā = (14 + 9999)/7 = 2 + 9999/7, so you can see how it's influencing the average. You could look at the empirical influence function; write your sample as 1,2,3,1,3,4,x and then see how the mean changes with x. (it's 2+x/7)

tukeysbinges 1 point

Thanks. Another thing if i may - would you describe an 'extreme outlier' as a peculiarity within the data?

I mean... It seems to fit the definition of peculiar (in that it's in some way special)... but I'm not sure if this kind of thing is natural. For some reason my head wants to look for things like variable names being spelled wrong or typos or something like that?

I'm just wondering if the term "peculiarities" makes you think about looking for something particular within a dataset.

thanks

efrique 1 point

I typed a reply but it seems to have gotten lost. I'll try to come back later an say something similar later

tukeysbinges 1 point

ha, damn. Thanks

Load more comments
der1n1t1ator 2 points

I would recommend looking at k-Means Clustering, whereas k for you is 5. This computes a voronoi tesselation for your data, easily speaking clustering your data into the five closest regions.

tukeysbinges 1 point

thanks, i've head of but never used clustering.

Also - here's some info that I wrote in response to someone else :


the split is going to be carried out on Total variable of the people in the cars. Then there's a Dead variable, which the amount of people that died in that accident.

So I will then find the probability of surviving for each accident... which will just be (Dead/Total) I guess.

Then this will be done for the over all values, so (OverAllDead / OverAllTotal), and these figures won't be in agreement with those from the splits... though how much depends on how I split I guess.


So that I will be splitting the TotalInTheCar variable into 5 categories.... which is distributed as in the histogram of the OP.

Hopefully that's clear

COOLSerdash 4 points

The source of your confusion lies in the fact that you don't seem to know what the ultimate goal of the categorization is. Why do you want to categorize the data in the first place?

There is no best way to categorize. Only more or less appropriate methods depending on the question you want to answer.

tukeysbinges 1 point

Good point - the split is going to be carried out on Total variable of the people in the cars. Then there's a Dead variable, which the amount of people that died in that accident.

So I will then find the probability of surviving for each accident... which will just be (Dead/Total) I guess.

Then this will be done for the over all values, so (OverAllDead / OverAllTotal), and these figures won't be in agreement with those from the splits... though how much depends on how I split I guess.

Hopefully this is some more context.

efrique 3 points

If it was "the same point repeated over and over" those points would show up like a single dot. You'd only see the very few points that weren't on that one dot.

[Or if that won't do it, two points each with thousands of repeats and the same y-value but different x-values, both on the line and then set a large loess-span]

tukeysbinges 1 point

right... so it's certainly not a line for the dots on that graph then :P

efrique 1 point

It would be -- they'd all be points on the graph, just not points you can visually distinguish. Coincident points is a problem with many such displays; if you cant tell one or two points have a huge weight, the plot can look wrong but it's simply that we can't see the points are stacked up.

There are some good solutions to that if you know it's a possibility (e.g. see if a jittered plot looks about the same, or use partially transparent colors)

tukeysbinges 1 point

OK cheers. It's safe to say that the line in the graph i posted isn't a loess of the scatter points though right?

Load more comments
infrequentaccismus 1 point

Your years are factored so they are discrete. It is treating 1980, 1981, 1982 the same as yellow, purple, blue. There is no value or inherent ordering. As a result, you cannot give this discrete, factored year value to scale_x_continuous(). You must either coerce your years back to numeric, or use scale_x_discrete()

tukeysbinges 1 point

well I guess I'm asking people here what's best as I'm currently learning, I don't know what I should do in this situation.

I tried to run the code provided by the OP and it didn't work for reasons I outlined.

I don't know what the comment

# Years is not factor, it is numeric

Means in the context - as It seems that ( as you're saying ) years is a factor?

cheers

infrequentaccismus 2 points

I don’t see that comment in the op’s code. Which user posted that comment in their code?

In the op’s code, yearLevels was explicitly coded as a factor, which makes it I possible to do math the year (for example, to see whether there is a different distance between each year level)

tukeysbinges 2 points
Load more comments
riricide 2 points

scale_x_continuous(breaks=c([insert desired tick marks on x axis]). Or use the scale_x_discrete version for your data, I think both will work. Breaks will store the vector of tick mark values you want to display on the graph.

kameltoe 1 point

This is what I would do.

If you don't want to hard code the ticks, i would do something like scale_x_continuous(breaks=c(df$x %>% cut(num_breaks))

tukeysbinges 1 point

i'm not familiar with this percentage syntax :S

i created a new post with code here. though, hopefully that's better explained

Load more comments
RShinra 1 point

it would be: (say your vector is called "dates")

sapply(strsplit(dates,"/"), "[[",3)

str_split is stringr, strsplit is Base R

tukeysbinges 1 point

oh ok i'll try that, thanks

RShinra 2 points

with stringr (package) you can split the string by "/" and pull out the last entry with an apply function, but I will +1 the lubridate solution in case you'd like to do more with the date (which tends to be the case when you look at the data longer)

tukeysbinges 1 point

OK thanks I'll try that.

I thought about writing a for loop, casting to string, grabbing the last 4 chars, setting up a new array, but it felt like a bit of a mess.

Load more comments
AlfredTheFifth 2 points

https://en.m.wikipedia.org/wiki/Algebraic_statistics

Read this. I've found a book on it once. But I don't know much about this..

tukeysbinges 1 point

ah thanks - apparently decision theory might use it or something as well

DCI_John_Luther 2 points

Its used a bit in equivariance, you can look at theory of point estimation for details (pdf on google).

tukeysbinges 1 point

thanks!

view more:
next ›
39 Karma
26 Post Karma
13 Comment Karma

Following this user will show all the posts they make to their profile on your front page.

About tukeysbinges

  • Reddit Birthday

    October 23, 2017

Other Interesting Profiles

    Want to make posts on your
    own profile?

    Sign up to test the Reddit post to profile beta.

    Sign up