Pearson distribution (chi-square distribution). Classical methods of statistics: chi-square test Ksi squared distribution

The chi-square distribution is one of the most widely used in statistics for testing statistical hypotheses. Based on the chi-square distribution, one of the most powerful goodness-of-fit tests is constructed - the Pearson chi-square test.

The criterion of agreement is the criterion for testing the hypothesis about the assumed law of an unknown distribution.

The χ2 (chi-square) test is used to test the hypothesis of different distributions. This is his dignity.

The calculation formula of the criterion is equal to

where m and m’ are empirical and theoretical frequencies, respectively

the distribution in question;

n is the number of degrees of freedom.

To check, we need to compare empirical (observed) and theoretical (calculated under the assumption of a normal distribution) frequencies.

If the empirical frequencies completely coincide with the frequencies calculated or expected, S (E – T) = 0 and the χ2 criterion will also be equal to zero. If S (E – T) is not equal to zero, this will indicate a discrepancy between the calculated frequencies and the empirical frequencies of the series. In such cases, it is necessary to evaluate the significance of the χ2 criterion, which theoretically can vary from zero to infinity. This is done by comparing the actually obtained value of χ2ф with its critical value (χ2st). The null hypothesis, i.e. the assumption that the discrepancy between the empirical and theoretical or expected frequencies is random, is refuted if χ2ф is greater than or equal to χ2st for the accepted significance level (a) and the number of degrees of freedom (n).

The distribution of probable values ​​of the random variable χ2 is continuous and asymmetric. It depends on the number of degrees of freedom (n) and approaches a normal distribution as the number of observations increases. Therefore, the application of the χ2 criterion to the assessment discrete distributions is associated with some errors that affect its value, especially in small samples. To obtain more accurate estimates, a sample distributed in variation series, must have at least 50 options. Correct application of the χ2 criterion also requires that the frequencies of variants in extreme classes should not be less than 5; if there are less than 5 of them, then they are combined with the frequencies of neighboring classes so that the total amount is greater than or equal to 5. According to the combination of frequencies, the number of classes (N) decreases. The number of degrees of freedom is established by the secondary number of classes, taking into account the number of restrictions on the freedom of variation.



Since the accuracy of determining the χ2 criterion largely depends on the accuracy of calculating theoretical frequencies (T), unrounded theoretical frequencies should be used to obtain the difference between the empirical and calculated frequencies.

As an example, let's take a study published on a website dedicated to the use of statistical methods in the humanities.

The Chi-square test allows you to compare frequency distributions regardless of whether they are normally distributed or not.

Frequency refers to the number of occurrences of an event. Usually, the frequency of occurrence of events is dealt with when variables are measured on a scale of names and their other characteristics, besides frequency, are impossible or problematic to select. In other words, when a variable has qualitative characteristics. Also, many researchers tend to convert test scores into levels (high, average, low) and build tables of score distributions to find out the number of people at these levels. To prove that in one of the levels (in one of the categories) the number of people is really greater (less) the Chi-square coefficient is also used.

Let's look at the simplest example.

A test was conducted among younger adolescents to identify self-esteem. The test scores were converted into three levels: high, medium, low. The frequencies were distributed as follows:

High (B) 27 people.

Average (C) 12 people.

Low (L) 11 people

It is obvious that the majority of children have high self-esteem, but this needs to be proven statistically. To do this, we use the Chi-square test.

Our task is to check whether the obtained empirical data differ from theoretically equally probable ones. To do this, you need to find the theoretical frequencies. In our case, theoretical frequencies are equally probable frequencies, which are found by adding all frequencies and dividing by the number of categories.

In our case:

(B + C + H)/3 = (27+12+11)/3 = 16.6

Formula for calculating the chi-square test:

χ2 = ∑(E - T)I / T

We build the table:

Find the sum of the last column:

Now you need to find the critical value of the criterion using the table of critical values ​​(Table 1 in the Appendix). To do this we need the number of degrees of freedom (n).

n = (R - 1) * (C - 1)

where R is the number of rows in the table, C is the number of columns.

In our case, there is only one column (meaning the original empirical frequencies) and three rows (categories), so the formula changes - we exclude the columns.

n = (R - 1) = 3-1 = 2

For the error probability p≤0.05 and n = 2, the critical value is χ2 = 5.99.

The obtained empirical value is greater than the critical value - the differences in frequencies are significant (χ2= 9.64; p≤0.05).

As you can see, calculating the criterion is very simple and does not take much time. The practical value of the chi-square test is enormous. This method is most valuable when analyzing responses to questionnaires.


Let's look at a more complex example.

For example, a psychologist wants to know whether it is true that teachers are more biased towards boys than towards girls. Those. more likely to praise girls. To do this, the psychologist analyzed the characteristics of students written by teachers for the frequency of occurrence of three words: “active,” “diligent,” “disciplined,” and synonyms of the words were also counted. Data on the frequency of occurrence of words were entered into the table:

To process the obtained data we use the chi-square test.

To do this, we will build a table of the distribution of empirical frequencies, i.e. those frequencies that we observe:

Theoretically, we expect that the frequencies will be equally distributed, i.e. the frequency will be distributed proportionally between boys and girls. Let's build a table of theoretical frequencies. To do this, multiply the row sum by the column sum and divide the resulting number by the total sum (s).

The final table for calculations will look like this:

χ2 = ∑(E - T)I / T

n = (R - 1), where R is the number of rows in the table.

In our case, chi-square = 4.21; n = 2.

Using the table of critical values ​​of the criterion, we find: with n = 2 and an error level of 0.05, the critical value is χ2 = 5.99.

The resulting value is less than the critical value, which means the null hypothesis is accepted.

Conclusion: teachers do not attach importance to the gender of the child when writing characteristics for him.


Conclusion.

K. Pearson made significant contributions to the development mathematical statistics(a large number of fundamental concepts). Pearson's main philosophical position is formulated as follows: the concepts of science are artificial constructions, means of describing and ordering sensory experience; the rules for connecting them into scientific sentences are isolated by the grammar of science, which is the philosophy of science. The universal discipline - applied statistics - allows us to connect disparate concepts and phenomena, although according to Pearson it is subjective.

Many of K. Pearson's constructions are directly related or developed using anthropological materials. He developed numerous methods of numerical classification and statistical criteria used in all areas of science.


Literature.

1. Bogolyubov A. N. Mathematics. Mechanics. Biographical reference book. - Kyiv: Naukova Dumka, 1983.

2. Kolmogorov A. N., Yushkevich A. P. (eds.). Mathematics of the 19th century. - M.: Science. - T. I.

3. 3. Borovkov A.A. Math statistics. M.: Nauka, 1994.

4. 8. Feller V. Introduction to the theory of probability and its applications. - M.: Mir, T.2, 1984.

5. 9. Harman G., Modern factor analysis. - M.: Statistics, 1972.

Before late XIX century, the normal distribution was considered the universal law of variation in data. However, K. Pearson noted that empirical frequencies can differ greatly from the normal distribution. The question arose of how to prove this. Not only a graphical comparison, which is subjective, was required, but also a strict quantitative justification.

This is how the criterion was invented χ 2(chi square), which tests the significance of the discrepancy between empirical (observed) and theoretical (expected) frequencies. This happened back in 1900, but the criterion is still in use today. Moreover, it has been adapted to solve a wide range of problems. First of all, this is the analysis of categorical data, i.e. those that are expressed not by quantity, but by belonging to some category. For example, the class of the car, the gender of the experiment participant, the type of plant, etc. Mathematical operations such as addition and multiplication cannot be applied to such data; frequencies can only be calculated for them.

We denote the observed frequencies About (Observed), expected – E (Expected). As an example, let's take the result of rolling a die 60 times. If it is symmetrical and uniform, the probability of getting any side is 1/6 and therefore the expected number of getting each side is 10 (1/6∙60). We write the observed and expected frequencies in a table and draw a histogram.

The null hypothesis is that the frequencies are consistent, that is, the actual data do not contradict the expected data. An alternative hypothesis is that the deviations in frequencies go beyond random fluctuations, the discrepancies are statistically significant. To draw a rigorous conclusion, we need.

  1. A summary measure of the discrepancy between observed and expected frequencies.
  2. The distribution of this measure if the hypothesis that there are no differences is true.

Let's start with the distance between frequencies. If you just take the difference O - E, then such a measure will depend on the scale of the data (frequencies). For example, 20 - 5 = 15 and 1020 - 1005 = 15. In both cases, the difference is 15. But in the first case, the expected frequencies are 3 times less than the observed ones, and in the second case - only 1.5%. We need a relative measure that does not depend on scale.

Let us pay attention to the following facts. In general, the number of categories into which frequencies are measured can be much larger, so the likelihood that a single observation will fall into one category or another is quite small. If so, then the distribution of such a random variable will obey the law of rare events, known as Poisson's law. In Poisson’s law, as is known, the value of the mathematical expectation and variance coincide (parameter λ ). This means that the expected frequency for some category of the nominal variable E i will be simultaneous and its dispersion. Further, Poisson's law tends to normal with a large number of observations. Combining these two facts, we obtain that if the hypothesis about the agreement between the observed and expected frequencies is correct, then, with a large number of observations, expression

It is important to remember that normality will only appear at sufficiently high frequencies. In statistics, it is generally accepted that the total number of observations (sum of frequencies) must be at least 50 and the expected frequency in each gradation must be at least 5. Only in this case, the value shown above has a standard normal distribution. Let's assume that this condition is met.

The standard normal distribution has almost all values ​​within ±3 (the three-sigma rule). Thus, we obtained the relative difference in frequencies for one gradation. We need a generalizable measure. You can’t just add up all the deviations - we get 0 (guess why). Pearson suggested adding up the squares of these deviations.

This is the sign Chi-square test Pearson. If the frequencies really correspond to the expected ones, then the value of the criterion will be relatively small (since most deviations are around zero). But if the criterion turns out to be large, then this indicates significant differences between frequencies.

The Pearson criterion becomes “large” when the occurrence of such or an even greater value becomes unlikely. And in order to calculate such a probability, it is necessary to know the distribution of the criterion when the experiment is repeated many times, when the hypothesis of frequency agreement is correct.

As is easy to see, the chi-square value also depends on the number of terms. The more there are, the greater the value the criterion should have, because each term will contribute to the total. Therefore, for each quantity independent terms, there will be its own distribution. It turns out that χ 2 is a whole family of distributions.

And here we come to one delicate moment. What is a number independent terms? It seems like any term (i.e. deviation) is independent. K. Pearson also thought so, but he turned out to be wrong. In fact, the number of independent terms will be one less than the number of gradations of the nominal variable n. Why? Because if we have a sample for which the sum of frequencies has already been calculated, then one of the frequencies can always be determined as the difference between the total number and the sum of all the others. Hence the variation will be somewhat less. Ronald Fisher noticed this fact 20 years after Pearson developed his criterion. Even the tables had to be redone.

On this occasion, Fisher introduced a new concept into statistics - degree of freedom(degrees of freedom), which represents the number of independent terms in the sum. The concept of degrees of freedom has a mathematical explanation and appears only in distributions associated with the normal (Student's, Fisher-Snedecor and chi-square itself).

To better grasp the meaning of degrees of freedom, let us turn to a physical analogue. Let's imagine a point moving freely in space. It has 3 degrees of freedom, because can move in any direction in three-dimensional space. If a point moves along any surface, then it already has two degrees of freedom (back and forth, left and right), although it continues to be in three-dimensional space. A point moving along a spring is again in three-dimensional space, but has only one degree of freedom, because can move either forward or backward. As you can see, the space where the object is located does not always correspond to real freedom of movement.

In approximately the same way, the distribution of a statistical criterion may depend on a smaller number of elements than the terms needed to calculate it. In general, the number of degrees of freedom is less than the number of observations by the number of existing dependencies.

Thus, the chi square distribution ( χ 2) is a family of distributions, each of which depends on the degrees of freedom parameter. And the formal definition of the chi-square test is as follows. Distribution χ 2(chi-square) s k degrees of freedom is the distribution of the sum of squares k independent standard normal random variables.

Next, we could move on to the formula itself by which the chi-square distribution function is calculated, but, fortunately, everything has long been calculated for us. To obtain the probability of interest, you can use either the appropriate statistical table or a ready-made function in Excel.

It is interesting to see how the shape of the chi-square distribution changes depending on the number of degrees of freedom.

With increasing degrees of freedom, the chi-square distribution tends to be normal. This is explained by the action of the central limit theorem, according to which the sum of a large number of independent random variables has a normal distribution. It doesn't say anything about squares)).

Testing the hypothesis using the Pearson chi-square test

Now we come to testing hypotheses using the chi-square method. In general, the technology remains. The null hypothesis is that the observed frequencies correspond to the expected ones (i.e. there is no difference between them because they are taken from the same population). If this is so, then the scatter will be relatively small, within the limits of random fluctuations. The measure of dispersion is determined using the chi-square test. Next, either the criterion itself is compared with the critical value (for the corresponding level of significance and degrees of freedom), or, what is more correct, the observed p-value is calculated, i.e. the probability of obtaining the same or even greater criterion value if the null hypothesis is true.

Because we are interested in the agreement of frequencies, then the hypothesis will be rejected when the criterion is greater than the critical level. Those. the criterion is one-sided. However, sometimes (sometimes) it is necessary to test the left-hand hypothesis. For example, when empirical data are very similar to theoretical data. Then the criterion may fall into an unlikely region, but on the left. The fact is that under natural conditions, it is unlikely to obtain frequencies that practically coincide with the theoretical ones. There is always some randomness that gives an error. But if there is no such error, then perhaps the data was falsified. But still, the right-sided hypothesis is usually tested.

Let's return to the dice problem. Let us calculate the value of the chi-square test using the available data.

Now let's find the critical value at 5 degrees of freedom ( k) and significance level 0.05 ( α ) according to the table of critical values ​​of the chi square distribution.

That is, the 0.05 quantile is a chi squared distribution (right tail) with 5 degrees of freedom χ 2 0.05; 5 = 11,1.

Let's compare the actual and tabulated values. 3.4 ( χ 2) < 11,1 (χ 2 0.05; 5). The calculated criterion turned out to be smaller, which means that the hypothesis of equality (agreement) of frequencies is not rejected. In the figure, the situation looks like this.

If the calculated value fell within the critical region, the null hypothesis would be rejected.

It would be more correct to also calculate the p-value. To do this, you need to find the closest value in the table for a given number of degrees of freedom and look at the corresponding significance level. But this last century. We will use a computer, in particular MS Excel. Excel has several functions related to chi-square.

Below is a brief description of them.

CH2.OBR– critical value of the criterion at a given probability on the left (as in statistical tables)

CH2.OBR.PH– critical value of the criterion for a given probability on the right. The function essentially duplicates the previous one. But here you can immediately indicate the level α , rather than subtracting it from 1. This is more convenient, because in most cases, it is the right tail of the distribution that is needed.

CH2.DIST– p-value on the left (density can be calculated).

CH2.DIST.PH– p-value on the right.

CHI2.TEST– immediately conducts a chi-square test for two frequency ranges. The number of degrees of freedom is taken to be one less than the number of frequencies in the column (as it should be), returning a p-value.

Let's calculate for our experiment the critical (tabular) value for 5 degrees of freedom and alpha 0.05. The Excel formula will look like this:

CH2.OBR(0.95;5)

CH2.OBR.PH(0.05;5)

The result will be the same - 11.0705. This is the value we see in the table (rounded to 1 decimal place).

Let us finally calculate the p-value for 5 degrees of freedom criterion χ 2= 3.4. We need the probability on the right, so we take the function with the addition of HH (right tail)

CH2.DIST.PH(3.4;5) = 0.63857

This means that with 5 degrees of freedom the probability of obtaining the criterion value is χ 2= 3.4 and more equals almost 64%. Naturally, the hypothesis is not rejected (p-value is greater than 5%), the frequencies are in very good agreement.

Now let's check the hypothesis about the agreement of frequencies using the chi-square test and the Excel function CHI2.TEST.

No tables, no cumbersome calculations. By specifying columns with observed and expected frequencies as function arguments, we immediately obtain the p-value. Beauty.

Now imagine that you are playing dice with a suspicious guy. The distribution of points from 1 to 5 remains the same, but he rolls 26 sixes (the total number of throws becomes 78).

The p-value in this case turns out to be 0.003, which is much less than 0.05. There are good reasons to doubt the validity of the dice. Here's what that probability looks like on a chi-square distribution chart.

The chi-square criterion itself here turns out to be 17.8, which, naturally, is greater than the table one (11.1).

I hope I was able to explain what the criterion of agreement is χ 2(Pearson chi-square) and how it can be used to test statistical hypotheses.

Finally, once again about an important condition! The chi-square test works properly only when the number of all frequencies exceeds 50, and the minimum expected value for each gradation is not less than 5. If in any category the expected frequency is less than 5, but the sum of all frequencies exceeds 50, then such the category is combined with the closest one so that their total frequency exceeds 5. If this is not possible, or the sum of the frequencies is less than 50, then more accurate methods of testing hypotheses should be used. We'll talk about them another time.

Below is a video on how to test a hypothesis in Excel using the chi-square test.

Let U 1 , U 2 , ..,U k be independent standard normal values. The distribution of the random variable K = U 1 2 +U 2 2 + .. + U k 2 is called the chi-square distribution with k degrees of freedom (write K~χ 2 (k)). This is a unimodal distribution with positive skewness and the following characteristics: mode M=k-2 expected value m=k dispersion D=2k (Fig.). With a sufficiently large value of the parameter k distribution χ 2 (k) has an approximately normal distribution with parameters

When solving problems of mathematical statistics, critical points χ 2 (k) are used, depending on the given probability α and the number of degrees of freedom k(Appendix 2). The critical point Χ 2 kr = Χ 2 (k; α) is the boundary of the region to the right of which lies 100- α % of the area under the distribution density curve. The probability that the value of the random variable K~χ 2 (k) during testing will fall to the right of the point χ 2 (k) does not exceed α P(K≥χ 2 kp)≤ α). For example, for the random variable K~χ 2 (20) we set the probability α=0.05. Using the table of critical points of the chi-square distribution (tables), we find χ 2 kp = χ 2 (20;0.05) = 31.4. This means that the probability of this random variable K accept a value greater than 31.4, less than 0.05 (Fig.).

Rice. Distribution density graph χ 2 (k) for different values ​​of the number of degrees of freedom k

Critical points χ 2 (k) are used in the following calculators:

  1. Checking for the presence of multicollinearity (about multicollinearity).
Testing a hypothesis using Chi-square will only answer the question “is there a relationship?”, further research is needed to test the direction of the relationship. Moreover, the Chi-square test has a certain error when working with low-frequency data.

Therefore, to check the direction of communication, select correlation analysis, in particular, testing the hypothesis using the Pearson correlation coefficient with further testing for significance using the t-test.

For any value of the significance level α Χ 2 can be found using the MS Excel function: =HI2OBR(α;degrees of freedom)

n-1 .995 .990 .975 .950 .900 .750 .500 .250 .100 .050 .025 .010 .005
1 0.00004 0.00016 0.00098 0.00393 0.01579 0.10153 0.45494 1.32330 2.70554 3.84146 5.02389 6.63490 7.87944
2 0.01003 0.02010 0.05064 0.10259 0.21072 0.57536 1.38629 2.77259 4.60517 5.99146 7.37776 9.21034 10.59663
3 0.07172 0.11483 0.21580 0.35185 0.58437 1.21253 2.36597 4.10834 6.25139 7.81473 9.34840 11.34487 12.83816
4 0.20699 0.29711 0.48442 0.71072 1.06362 1.92256 3.35669 5.38527 7.77944 9.48773 11.14329 13.27670 14.86026
5 0.41174 0.55430 0.83121 1.14548 1.61031 2.67460 4.35146 6.62568 9.23636 11.07050 12.83250 15.08627 16.74960
6 0.67573 0.87209 1.23734 1.63538 2.20413 3.45460 5.34812 7.84080 10.64464 12.59159 14.44938 16.81189 18.54758
7 0.98926 1.23904 1.68987 2.16735 2.83311 4.25485 6.34581 9.03715 12.01704 14.06714 16.01276 18.47531 20.27774
8 1.34441 1.64650 2.17973 2.73264 3.48954 5.07064 7.34412 10.21885 13.36157 15.50731 17.53455 20.09024 21.95495
9 1.73493 2.08790 2.70039 3.32511 4.16816 5.89883 8.34283 11.38875 14.68366 16.91898 19.02277 21.66599 23.58935
10 2.15586 2.55821 3.24697 3.94030 4.86518 6.73720 9.34182 12.54886 15.98718 18.30704 20.48318 23.20925 25.18818
11 2.60322 3.05348 3.81575 4.57481 5.57778 7.58414 10.34100 13.70069 17.27501 19.67514 21.92005 24.72497 26.75685
12 3.07382 3.57057 4.40379 5.22603 6.30380 8.43842 11.34032 14.84540 18.54935 21.02607 23.33666 26.21697 28.29952
13 3.56503 4.10692 5.00875 5.89186 7.04150 9.29907 12.33976 15.98391 19.81193 22.36203 24.73560 27.68825 29.81947
14 4.07467 4.66043 5.62873 6.57063 7.78953 10.16531 13.33927 17.11693 21.06414 23.68479 26.11895 29.14124 31.31935
15 4.60092 5.22935 6.26214 7.26094 8.54676 11.03654 14.33886 18.24509 22.30713 24.99579 27.48839 30.57791 32.80132
16 5.14221 5.81221 6.90766 7.96165 9.31224 11.91222 15.33850 19.36886 23.54183 26.29623 28.84535 31.99993 34.26719
17 5.69722 6.40776 7.56419 8.67176 10.08519 12.79193 16.33818 20.48868 24.76904 27.58711 30.19101 33.40866 35.71847
18 6.26480 7.01491 8.23075 9.39046 10.86494 13.67529 17.33790 21.60489 25.98942 28.86930 31.52638 34.80531 37.15645
19 6.84397 7.63273 8.90652 10.11701 11.65091 14.56200 18.33765 22.71781 27.20357 30.14353 32.85233 36.19087 38.58226
20 7.43384 8.26040 9.59078 10.85081 12.44261 15.45177 19.33743 23.82769 28.41198 31.41043 34.16961 37.56623 39.99685
21 8.03365 8.89720 10.28290 11.59131 13.23960 16.34438 20.33723 24.93478 29.61509 32.67057 35.47888 38.93217 41.40106
22 8.64272 9.54249 10.98232 12.33801 14.04149 17.23962 21.33704 26.03927 30.81328 33.92444 36.78071 40.28936 42.79565
23 9.26042 10.19572 11.68855 13.09051 14.84796 18.13730 22.33688 27.14134 32.00690 35.17246 38.07563 41.63840 44.18128
24 9.88623 10.85636 12.40115 13.84843 15.65868 19.03725 23.33673 28.24115 33.19624 36.41503 39.36408 42.97982 45.55851
25 10.51965 11.52398 13.11972 14.61141 16.47341 19.93934 24.33659 29.33885 34.38159 37.65248 40.64647 44.31410 46.92789
26 11.16024 12.19815 13.84390 15.37916 17.29188 20.84343 25.33646 30.43457 35.56317 38.88514 41.92317 45.64168 48.28988
27 11.80759 12.87850 14.57338 16.15140 18.11390 21.74940 26.33634 31.52841 36.74122 40.11327 43.19451 46.96294 49.64492
28 12.46134 13.56471 15.30786 16.92788 18.93924 22.65716 27.33623 32.62049 37.91592 41.33714 44.46079 48.27824 50.99338
29 13.12115 14.25645 16.04707 17.70837 19.76774 23.56659 28.33613 33.71091 39.08747 42.55697 45.72229 49.58788 52.33562
30 13.78672 14.95346 16.79077 18.49266 20.59923 24.47761 29.33603 34.79974 40.25602 43.77297 46.97924 50.89218 53.67196
Number of degrees of freedom k Significance level a
0,01 0,025 0.05 0,95 0,975 0.99
1 6.6 5.0 3.8 0.0039 0.00098 0.00016
2 9.2 7.4 6.0 0.103 0.051 0.020
3 11.3 9.4 7.8 0.352 0.216 0.115
4 13.3 11.1 9.5 0.711 0.484 0.297
5 15.1 12.8 11.1 1.15 0.831 0.554
6 16.8 14.4 12.6 1.64 1.24 0.872
7 18.5 16.0 14.1 2.17 1.69 1.24
8 20.1 17.5 15.5 2.73 2.18 1.65
9 21.7 19.0 16.9 3.33 2.70 2.09
10 23.2 20.5 18.3 3.94 3.25 2.56
11 24.7 21.9 19.7 4.57 3.82 3.05
12 26.2 23.3 21 .0 5.23 4.40 3.57
13 27.7 24.7 22.4 5.89 5.01 4.11
14 29.1 26.1 23.7 6.57 5.63 4.66
15 30.6 27.5 25.0 7.26 6.26 5.23
16 32.0 28.8 26.3 7.96 6.91 5.81
17 33.4 30.2 27.6 8.67 7.56 6.41
18 34.8 31.5 28.9 9.39 8.23 7.01
19 36.2 32.9 30.1 10.1 8.91 7.63
20 37.6 34.2 31.4 10.9 9.59 8.26
21 38.9 35.5 32.7 11.6 10.3 8.90
22 40.3 36.8 33.9 12.3 11.0 9.54
23 41.6 38.1 35.2 13.1 11.7 10.2
24 43.0 39.4 36.4 13.8 12.4 10.9
25 44.3 40.6 37.7 14.6 13.1 11.5
26 45.6 41.9 38.9 15.4 13.8 12.2
27 47.0 43.2 40.1 16.2 14.6 12.9
28 48.3 44.5 41.3 16.9 15.3 13.6
29 49.6 45.7 42.6 17.7 16.0 14.3
30 50.9 47.0 43.8 18.5 16.8 15.0

Pearson (chi-squared), Student and Fisher distributions

Using the normal distribution, three distributions are defined that are now often used in statistical data processing. These distributions appear many times in later sections of the book.

Pearson distribution (chi - square) – distribution of a random variable

Where random variables X 1 , X 2 ,…, X n independent and have the same distribution N(0,1). In this case, the number of terms, i.e. n, is called the “number of degrees of freedom” of the chi-square distribution.

The chi-square distribution is used when estimating variance (using a confidence interval), when testing hypotheses of agreement, homogeneity, independence, primarily for qualitative (categorized) variables that take a finite number of values, and in many other tasks statistical analysis data

Distribution t Student's t is the distribution of a random variable

where are the random variables U And X independent, U has a standard normal distribution N(0.1), and X– chi distribution – square c n degrees of freedom. Wherein n is called the “number of degrees of freedom” of the Student distribution.

The Student distribution was introduced in 1908 by the English statistician W. Gosset, who worked at a beer factory. Probabilistic and statistical methods were used to make economic and technical decisions at this factory, so its management forbade V. Gosset to publish scientific articles under his own name. In this way, trade secrets and “know-how” in the form of probabilistic and statistical methods developed by V. Gosset were protected. However, he had the opportunity to publish under the pseudonym "Student". The Gosset-Student story shows that even a hundred years ago, British managers were aware of the great economic efficiency probabilistic-statistical methods.

Currently, the Student distribution is one of the most well-known distributions used in the analysis of real data. It is used when estimating the mathematical expectation, forecast value and other characteristics using confidence intervals, testing hypotheses about the values ​​of mathematical expectations, regression coefficients, hypotheses of sample homogeneity, etc. .

The Fisher distribution is the distribution of a random variable

where are the random variables X 1 And X 2 are independent and have chi-square distributions with the number of degrees of freedom k 1 And k 2 respectively. At the same time, the couple (k 1 , k 2 ) – a pair of “degrees of freedom” of the Fisher distribution, namely, k 1 is the number of degrees of freedom of the numerator, and k 2 – number of degrees of freedom of the denominator. Distribution of a random variable F named after the great English statistician R. Fisher (1890-1962), who actively used it in his works.

The Fisher distribution is used when testing hypotheses about the adequacy of the model in regression analysis, equality of variances, and in other problems of applied statistics.

Expressions for the chi-square, Student and Fisher distribution functions, their densities and characteristics, as well as the tables necessary for their practical use, can be found in the specialized literature (see, for example,).

23. Concept of chi-square and Student distribution, and graphical view

1) A distribution (chi-square) with n degrees of freedom is the distribution of the sum of squares of n independent standard normal random variables.

Distribution (chi-square)– distribution of a random variable (and the mathematical expectation of each of them is 0, and the standard deviation is 1)

where are the random variables are independent and have the same distribution. In this case, the number of terms, i.e. , is called the "number of degrees of freedom" of the chi-square distribution. The chi-square number is determined by one parameter, the number of degrees of freedom. As the number of degrees of freedom increases, the distribution slowly approaches normal.

Then the sum of their squares

is a random variable distributed according to the so-called chi-square law with k = n degrees of freedom; if the terms are related by some relation (for example, ), then the number of degrees of freedom k = n – 1.

The density of this distribution

Here is the gamma function; in particular, Г(n + 1) = n! .

Therefore, the chi-square distribution is determined by one parameter - the number of degrees of freedom k.

Remark 1. As the number of degrees of freedom increases, the chi-square distribution gradually approaches normal.

Remark 2. Using the chi-square distribution, many other distributions encountered in practice are determined, for example, the distribution of a random variable - the length of a random vector (X1, X2,..., Xn), the coordinates of which are independent and distributed according to the normal law.

The χ2 distribution was first considered by R. Helmert (1876) and K. Pearson (1900).

Math.expect.=n; D=2n

2) Student distribution

Consider two independent random variables: Z, which has a normal distribution and is normalized (that is, M(Z) = 0, σ(Z) = 1), and V, which is distributed according to the chi-square law with k degrees of freedom. Then the value

has a distribution called the t-distribution or Student distribution with k degrees of freedom. In this case, k is called the “number of degrees of freedom” of the Student distribution.

As the number of degrees of freedom increases, the Student distribution quickly approaches normal.

This distribution was introduced in 1908 by the English statistician W. Gosset, who worked at a beer factory. Probabilistic and statistical methods were used to make economic and technical decisions at this factory, so its management forbade V. Gosset to publish scientific articles under his own name. In this way, trade secrets and “know-how” in the form of probabilistic and statistical methods developed by V. Gosset were protected. However, he had the opportunity to publish under the pseudonym "Student". The Gosset-Student story shows that even a hundred years ago, UK managers were aware of the greater economic efficiency of probabilistic and statistical methods of decision making.

Share with friends or save for yourself:

Loading...