This criticism only applies to two-tailed tests, where the null hypothesis is "Things are exactly the same" and the alternative is "Things are different." Presumably these critics think it would be okay to do a one-tailed test with a null hypothesis like "Foot length of male chickens is the same as, or less than, that of females," because the null hypothesis that male chickens have smaller feet than females could be true. So if you're worried about this issue, you could think of a two-tailed test, where the null hypothesis is that things are the same, as shorthand for doing two one-tailed tests. A significant rejection of the null hypothesis in a two-tailed test would then be the equivalent of rejecting one of the two one-tailed null hypotheses.
In the olden days, when people looked up P values in printed tables, they would report the results of a statistical test as "PPP>0.10", etc. Nowadays, almost all computer statistics programs give the exact P value resulting from a statistical test, such as P=0.029, and that's what you should report in your publications. You will conclude that the results are either significant or they're not significant; they either reject the null hypothesis (if P is below your pre-determined significance level) or don't reject the null hypothesis (if P is above your significance level). But other people will want to know if your results are "strongly" significant (P much less than 0.05), which will give them more confidence in your results than if they were "barely" significant (P=0.043, for example). In addition, other researchers will need the exact P value if they want to combine your results with others into a .
The basis for many nonparametric tests involves discarding the actual numbers in the dataset and replacing them with numerical rankings from lowest to highest. Thus, the dataset 7, 12, 54, 103 would be replaced with 1, 2, 3, and 4, respectively. This may sound odd, but the general method, referred to as a , is well grounded. In the case of the Mann-Whitney test, which is used to compare two unpaired groups, data from both groups are combined and ranked numerically (1, 2, 3, … ). Then the rank numbers are sorted back into their respective starting groups, and a is tallied for each group. If both groups were sampled from populations with identical means (the null hypothesis), then there should be relatively little difference in their mean ranks, although chance sampling will lead to some differences. Put another way, high- and low-ranking values should be more or less evenly distributed between the two groups. Thus for the Mann-Whitney test, the -value will answer the following question: Based on the mean ranks of the two groups, what is the probability that they are derived from populations with identical means? As for parametric tests, a -value ≤ 0.05 is traditionally accepted as statistically significant.
It is also worth pointing out that there is another way in which the -test could be used for this analysis. Namely, we could take the ratios from the first three blots (3.33, 3.41, and 2.48), which average to 3.07, and carry out a one-sample two-tailed -test. Because the null hypothesis is that there is no difference in the expression of protein X between wild-type and backgrounds, we would use an expected ratio of 1 for the test. Thus, the -value will tell us the probability of obtaining a ratio of 3.07 if the expected ratio is really one. Using the above data points, we do in fact obtain = 0.02, which would pass our significance cutoff. In fact, this is a perfectly reasonable use of the -test, even though the test is now being carried out on ratios rather than the unprocessed data. Note, however, that changing the numbers only slightly to 3.33, 4.51, and 2.48, we would get a mean of 3.44 but with a corresponding -value of 0.054. This again points out the problem with -tests when one has very small sample sizes and moderate variation within samples.
We specify an alternative hypothesis as a candidate should the evidence dictate that the null hypothesis be rejected. Non-directional (two-sided) alternative hypotheses state that the population parameter is different from the value stated in the null hypothesis; directional (one-sided) alternative hypotheses state that the population parameter is greater (or lesser) than the null value.
You should decide whether to use the one-tailed or two-tailed probability before you collect your data, of course. A one-tailed probability is more powerful, in the sense of having a lower chance of false negatives, but you should only use a one-tailed probability if you really, truly have a firm prediction about which direction of deviation you would consider interesting. In the chicken example, you might be tempted to use a one-tailed probability, because you're only looking for treatments that decrease the proportion of worthless male chickens. But if you accidentally found a treatment that produced 87% male chickens, would you really publish the result as "The treatment did not cause a significant decrease in the proportion of male chickens"? I hope not. You'd realize that this unexpected result, even though it wasn't what you and your farmer friends wanted, would be very interesting to other people; by leading to discoveries about the fundamental biology of sex-determination in chickens, in might even help you produce more female chickens someday. Any time a deviation in either direction would be interesting, you should use the two-tailed probability. In addition, people are skeptical of one-tailed probabilities, especially if a one-tailed probability is significant and a two-tailed probability would not be significant (as in our chocolate-eating chicken example). Unless you provide a very convincing explanation, people may think you decided to use the one-tailed probability after you saw that the two-tailed probability wasn't quite significant, which would be cheating. It may be easier to always use two-tailed probabilities. For this handbook, I will always use two-tailed probabilities, unless I make it very clear that only one direction of deviation from the null hypothesis would be interesting.
How do you know which hypothesis to put in H0 and which one to put in Ha? Typically, the null hypothesis says that nothing new is happening; the previous result is the same now as it was before, or the groups have the same average (their difference is equal to zero). In general, you assume that people’s claims are true until proven otherwise. So the question becomes: Can you prove otherwise? In other words, can you show sufficient evidence to reject H0?
Most statistical tests culminate in a statement regarding the -value, without which reviewers or readers may feel shortchanged. The -value is commonly defined as the probability of obtaining a result (more formally a ) that is at least as extreme as the one observed, assuming that the is true. Here, the specific null hypothesis will depend on the nature of the experiment. In general, the null hypothesis is the statistical equivalent of the “innocent until proven guilty” convention of the judicial system. For example, we may be testing a mutant that we suspect changes the ratio of male-to-hermaphrodite cross-progeny following mating. In this case, the null hypothesis is that the mutant does not differ from wild type, where the sex ratio is established to be 1:1. More directly, the null hypothesis is that the sex ratio in mutants is 1:1. Furthermore, the complement of the null hypothesis, known as the or , would be that the sex ratio in mutants is different than that in wild type or is something other than 1:1. For this experiment, showing that the ratio in mutants is different than 1:1 would constitute a finding of interest. Here, use of the term “significantly” is short-hand for a particular technical meaning, namely that the result is , which in turn implies only that the observed difference appears to be real and is not due only to random chance in the sample(s). . Moreover, the term significant is not an ideal one, but because of long-standing convention, we are stuck with it. Statistically or statistically may in fact be better terms.
Getting back to -values, let's imagine that in an experiment with mutants, 40% of cross-progeny are observed to be males, whereas 60% are hermaphrodites. A statistical significance test then informs us that for this experiment, = 0.25. We interpret this to mean that even if there was no actual difference between the mutant and wild type with respect to their sex ratios, we would still expect to see deviations as great, or greater than, a 6:4 ratio in 25% of our experiments. Put another way, if we were to replicate this experiment 100 times, random chance would lead to ratios at least as extreme as 6:4 in 25 of those experiments. Of course, you may well wonder how it is possible to extrapolate from one experiment to make conclusions about what (approximately) the next 99 experiments will look like. (Short answer: There is well-established statistical theory behind this extrapolation that is similar in nature to our discussion on the SEM.) In any case, a large -value, such as 0.25, is a red flag and leaves us unconvinced of a difference. It is, however, possible that a true difference exists but that our experiment failed to detect it (because of a small sample size, for instance). In contrast, suppose we found a sex ratio of 6:4, but with a corresponding -value of 0.001 (this experiment likely had a much larger sample size than did the first). In this case, the likelihood that pure chance has conspired to produce a deviation from the 1:1 ratio as great or greater than 6:4 is very small, 1 in 1,000 to be exact. Because this is very unlikely, we would conclude that the null hypothesis is not supported and that mutants really do differ in their sex ratio from wild type. Such a finding would therefore be described as statistically significant on the basis of the associated low -value.
The probability that was calculated above, 0.030, is the probability of getting 17 or fewer males out of 48. It would be significant, using the conventional PP=0.03 value found by adding the probabilities of getting 17 or fewer males. This is called a one-tailed probability, because you are adding the probabilities in only one tail of the distribution shown in the figure. However, if your null hypothesis is "The proportion of males is 0.5", then your alternative hypothesis is "The proportion of males is different from 0.5." In that case, you should add the probability of getting 17 or fewer females to the probability of getting 17 or fewer males. This is called a two-tailed probability. If you do that with the chicken result, you get P=0.06, which is not quite significant.