×

Levenes test for variance - if the variances are unequal what do we use? by tukeysbinges in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

The default t-test in R is welch's, so they don't really need to check for equality of variance, just the normality assumption.

Binary factor with a t.test (this is a two sample or one sample?) by tukeysbinges in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

Your data structure and the question you are asking is a little confusing. First, by mean of person and and B, are you referring to the mean of the continuous variable "y" you referred to? Second, you have a dichotamous variable "x", but you don't explain what it is. How does it relate to person A and B?

Help with stats problem. by dryrubs in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

It doesn't specify that you need to do all possible comparisons with each test, or include all factors in one model. They probably just want you to do, for example, a t-test comparing two of the continuous variables.

Question for a research paper by Esem111 in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

You need to provide us with information about the experiment. We can't tell you anything unless we know what you were doing.

Null hypothesis by ProBKEmployee in AskStatistics

[–]multi-mod 4 points5 points  (0 children)

p value (probability that the null hypothesis is correct).

The gods have abandoned us.

How to account for bias in a test and control experiment? by grovebost1 in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

If you use a mixed effect model you don't need to worry about adjusting the effect size before running the model. That gets taken care of by the random effect.

How to account for bias in a test and control experiment? by grovebost1 in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

In this case you have paired data, because you are checking the same households before and after an intervention (an ad). As your colleagues have stated, it's important to consider the affect size before the intervention.

For your particular data a good starting point would be a general linear model with mixed effects (linear regression in this case), with your random effect being the household.

Understanding strength of a particular value of a variable when compared against representative samples - Machine Learning? by [deleted] in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

Feature importance will depend on the method used for classification. Some classification methods, such as logistic regression, reduce the interaction of the IV with the DV to a formula, so the model has the magnitude and direction of the interaction built in. Other methods, such as decision trees, don't build a formula for the interaction, which makes feature importance more of a nebulous process. In general a combination of plotting the probabilities over the range of the IV, and looking at a importance index such as the gini coefficient is usually used.

Is cross validation error on a training set a good predictor of error if the model was run on the test dataset? by ayeandone in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

Validation of your training set is generally used to model the performance of your model parameters and the training process itself, and not to measure the performance of your model with new data.

If you want to obtain a confidence interval on a performance metric for a model, such as accuracy or precision, your best bet would be bootstrapping your test set. This will be closer to what you wanted to know - how your model handles new data.

With the bootstrap method you take your test set and then make 1000+ new test sets by generating new samples through random sampling with replacement from your original test set. For all of these new test sets you then find your model performance metrics, such as accuracy. You will then end up with, as an example, 1000+ accuracy measurements. To get the simple 95% CI you can then take the 0.025 and 0.975 quantiles from these values to get the lower and upper limit of the CI.

Help with a basic problem by CatQueenOfTrash in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

for each test you need to have an observed and expected value.

Let's say that you wanted to compare r versus b. r counties have a total population of 1000 and combined 10 fireworks were set off. b has a population of 10,000 and 40 fireworks were set off.

We want to know whether the proportion of fireworks set off in r is different than b. The observed for r is 10, and the expected is 10 since we are using that as the reference. The observed for b is 40 and the expected would be the value you would see in b if the proportion was the same as r. So for r it is 1% of the population, meaning the expected in b is .01 * 10,000 = 100.

Help with a basic problem by CatQueenOfTrash in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

Your response variable in this case is a count (meaning they can only be integers). ANOVAs are generally for continuous data, meaning that fractional values are possible.

The simply solution for you would be to perform 3 pairwise chi-squared tests: r v. b, r v. n, and b v. n. You can then make conclusions such as whether counties in group r tended to shoot off more fireworks than counties in group b. You are doing so few multiple comparisons that a correction is not needed.

A more elegant solution would be to do this all in one regression model. Since you have count data you could perform a poisson regression, or the more robust negative binomial regression.

In R you can perform a negative binomial regression in the MASS package. The formula would be glm.nb(fireworks ~ group + population, data=data). The p-value and coefficient you will get for group will be relative to holding population constant. Please note that you would need to add contrasts to get all pairwise comparisons in one model.

Help with a basic problem by CatQueenOfTrash in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

What do you mean exactly when you say that the population is influencing the response variable? How did you conclude this?

Are drunk walking patterns truly random or are there patterns? by etevian in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

If it was random, you would have close to no chance of ever reaching your intended destination.

Comparing two supposedly random datasets by Felewin in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

You really shouldn't use a t-test for a dichotomous dependent variable.

A better solution would be a binomial logistic regression. In R this would be under the generalized linear model (glm) command.

Finding new SD by arnolda2 in AskStatistics

[–]multi-mod 2 points3 points  (0 children)

Your teacher assigns you homework so you can learn the material.

Should be pretty easy for you guys to figure out but I'm dumb apparently by theIinhappiness in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

If you want to avoid zeroing out in a martingale betting system, you want to minimize your starting bet and maximize your starting pool. This will increase the probability you don't zero out over a fixed length of rolls, but will also decrease the rate of net winnings over that fixed number of rolls. Essentially you are less likely to zero out, but less likely to make a lot of money.

Help with p-value, alpha and two tailed test by self_love23 in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

The alpha stays the same, it's the resulting p-value that you change. You first get the one tailed p-value, and then double it.

If your p-value of .03 was determined with a one-tailed test, then it would be 0.06 for a 2 tailed test, so you would fail to reject the null hypothesis with an alpha of 0.05.

Is a logistic regression right for my study by green251 in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

Yep. If longer fish tended to spawn more often then you would end up with a positive coefficient and a p-value below your threshold. If longer fish tended to spawn less, you would get a negative coefficient and a p-value below your threshold. Finally if fish length had no effect on spawning you would get a coefficient close to 0 and a p-value above your threshold.

Another consideration is that you can add other potentially interesting predictors, such as fish color, weight, etc. Finally, you can add potentially confounding collection biases as random effects (like person collecting, month being collected, etc.) to control for them.

Is a logistic regression right for my study by green251 in AskStatistics

[–]multi-mod 3 points4 points  (0 children)

A logistic regression will provide you with a log odds ratio of your predictor, i.e. how more likely your reference category (fish spawning) is when you increase your continuous predictor (length) by one unit.

You will also usually be provided with a p-value, which will help you decide whether to reject the null hypothesis, which is the coefficient of the predictor being 0 (meaning that there is likely no association).

Multiple Possible Outcomes on Multiple Dice for Multiple Trials by just_for_you_32 in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

You can still use the binomial distribution, but the probability of success will be P(p). You can do this because you can consider getting "p" as your success, and getting anything else as your failure.

How to find the probability of a single unit in a large population dipping below a certain value based on a sample? by emperorvinayak in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

The first thing you should do with data is to visualize it. In this case, a histogram will help you determine the shape of your distribution. For ease of explanation, lets assume that you make a histogram of your data and it looks relatively normally distributed. Because of the assumption of normality, you can generate a normal distribution with your mean and sd. Afterwards, you can find the probability of obtaining a pellet of 1.2 grams or less by integrating from 0 to 1.2 on your distribution, or by converting 1.2 to a z-score (how many standard deviations away from the mean the value is) and looking up the p-value on a table. Under this assumption you would expect to get a pellet of size 1.2 grams or less ~ 0.4% of the time. That means if you had 330 pellets, you would expect to get at least 1 that is below your threshold.

If you want your data to be representative of the population, you need to ensure that your sample is accurately capturing the population. The easiest way for you to do this for your "experiment" would be to measure more pellets. The more pellets you measure the more confidence you will have in your sample and population parameters (mean and sd in your case). The confidence in your parameters can be captured multiple ways, but an easy example is the classic confidence interval which helps show the confidence in your mean value.

If your data is not normally distributed, that is an entirely different story. Where you go from there will depend a lot on the shape of the distribution and what information you want to get from the data.

R vs Python by AnonBiomed2 in labrats

[–]multi-mod 5 points6 points  (0 children)

Bioconductor on R has a rich set of genomics tools, so if you want the most bang for your buck I would start with R. Both R and Python use similar syntax for making scripts, so you will be teaching yourself the fundamentals of the other language no matter which one you pick.

Binomial vs Geometric Distribution by [deleted] in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

Binomial distributions describe, for example, how many heads you would expect to get if you flipped a fair sided coin 100 times. The geometric distribution asks how many tails you would expect to flip before getting your first heads. The probability of success for these distributions doesn't need to be 0.5, but whatever is desired.

The normal distribution describes continuous data, but binomial and geometric distributions describe discrete data (you can't have a fractional number of coin flips as an example).