Testing the hypothesis that the mean of the general population is equal to some given value. Testing the hypothesis about the equality of the means of two normal distributions with known variances

Among the most important generalizing characteristics, regarding which hypotheses are most often put forward, is average value. In order to test the hypothesis about the equality of means in the general population, it is necessary to formulate a null hypothesis. In this case, as a rule, it is assumed that both samples are taken from a normally distributed general population with mathematical expectation equal to X and with a variance equal to c0 . If this assumption is correct, then x1 - x2 ~ x. In fact, the sample means X1 and X2 will not be equal due to the randomness of the sample. Therefore, it is necessary to find out the significance of the differences between x1 x2 - whether their difference is within the limits of a possible random variation or whether it goes beyond these limits. Then the task of testing the hypothesis is reduced to testing the significance of the difference

Each sample mean has its own error. /And:

Having determined the variances and average error sample means, you can calculate the actual value of the I-test and compare it with the critical (tabular) value at the appropriate significance level and the number of degrees of freedom of variation (for samples with n > 30, the U-test is used normal distribution, and for samples with number n< 30 - и-критерий Стьюдента).

The actual value of the i-criterion is determined by the formula

If the sample value of the criterion falls into the critical region (їfakі> O), the null hypothesis about the equality of the means is rejected; if the sample value of the criterion falls into the region allowed values(Іfakg< їа), нулевая гипотеза принимается.

The null hypothesis that the means in two populations are equal can also be tested by comparing the actual mean difference [єFa,.t = ~~2 ) with a limiting random error at a given level of significance (ea). If the actual difference between the sample means is within the random error< еа), нулевая гипотеза принимается. Если же фактическая разница между средними выходит за пределы случайной ошибки (еф^т >ea), the null hypothesis is rejected.

When solving specific problems of testing statistical hypotheses regarding averages, the following points should be taken into account: 1) the sampling scheme (samples are independent and dependent); 2) equality or inequality of sample sizes; 3) equality or inequality of variances of general populations.

The algorithm for testing the hypothesis regarding two means changes somewhat if the variances for the samples (512 and 522) are significantly different. In this case, when determining the number of degrees of freedom, an amendment is introduced:

When, with unequal variances across samples, their numbers are also uneven (n1 and n2), the tabular value of Student's t-test should be calculated using the formula

where u1 and u2 are the tabular values ​​of the Student's t-test, which are taken in accordance with n1- 1 and n2 - 1 degrees of freedom.

Consider an example of testing a statistical hypothesis about the equality of two average independent samples of equal size (n1=n2) And equal variances(SG;2 =).

Yes, there are data on the live weight of calves at birth for two groups of black-and-white cows (cows of the same age). The first group of cows had a normal duration of lactation (305 days), and the second group was milked for 320 days. Each group included 5 cows. These observations are given in table. 7.2.

Table 7.2. Live weight of calves at birth by groups of cows with different duration lactation

Comparison of live weights of calves in two groups of cows shows that a higher live weight of calves is observed in cows of the I group that had a normal duration of lactation. However, due to the fact that the number of samples is small (n = 5), the possibility is not ruled out that the disagreements between the live weights were obtained as a result of random causes.

It is necessary to statistically evaluate the difference between the averages for the two groups of cows.

Based on the results of testing the hypothesis, conclude that the difference between the means lies within the limits of random fluctuations, or this difference is so significant that it is not consistent with the null hypothesis about the random nature of the differences between the means.

If the second position is proved and the first is rejected, it can be argued that the duration of lactation affects the live weight of calves.

The condition of the problem assumes that both samples are taken from a normally distributed general population. The formation of groups is random (independent), so the difference between the means should be evaluated.

Let's determine the average live weight of calves for two groups of cows:

The actual difference between the means is:

The significance of this difference must be assessed. To do this, it is necessary to test the hypothesis that the two means are equal.

Let us consider in detail all the stages of the hypothesis testing scheme. 1. Let us formulate the zero But and Na alternative hypotheses:

2. Let's take a significance level a = 0.05, guaranteeing the acceptance of the hypothesis or its rejection with the probability of error only in 5 cases out of 100.

3. The most powerful criterion for testing this kind of hypothesis H0 is Student's u-test.

4. Let us formulate a rule for making a decision based on the results

checking H0. Since according to the alternative hypothesis x1 may be less or more x2, then the critical region must be established from two

sides: and - ~ ia and and - ia, or in short: ia.

This form of setting the criterion is called bilateral critical region. The critical region at a = 0.05 will be contained within - all values ​​higher than the upper 2.5% and lower than 2.5% of the distribution point of Student's u-test.

In view of the above, the conclusions on checking H0 can be formulated as follows: the hypothesis H0 will be rejected if the actual value of the Γ-criterion turns out to be

more tabular value, that is, if if > ia. Otherwise Ka must be accepted.

5. To check H0, you need to determine the actual value of Student's G-test and compare it with the table value.

To determine the actual value of Student's t-test, we perform the following calculations.

6. Calculate for each sample the variance variations corrected for the loss of degrees of freedom. To do this, we first square the values ​​of хц and х2і:

7. Calculate the squared mean errors for each sample and the generalized mean error of the mean difference:

8. Calculate the actual value of Student's G-test:

9. Determine the tabular value of the G-Student test, based on the significance level a = 0.05 and the number of degrees of freedom for two samples:

According to the table Critical points Student's distribution" (additional 3) we find and at a = 0.05 and k = 8: i005 = 2.31.

10. Let's compare the actual and tabular value-Student's criterion:

Since ifackg< и^05 (выборочное значение критерия находится в области допустимых значений), нулевая гипотеза о равенстве средних генеральных совокупностях принимается.

So, the effect of the duration of lactation on the live weight of calves at birth is underestimated.

However, attention should be paid to such an essential point: the live weight of calves at birth in all observations of the experiment is higher in the first group of cows that have a normal duration of lactation. Therefore, instead of the alternative hypothesis On the x1 f x2 another can be taken. Since there is no reason to believe that with a normal duration of lactation, the live weight of calves will be lower, it is obvious that a more appropriate form of the alternative hypothesis is: Ha: x1 > x2.

Then the critical region, which is 0.05 of the entire area under the distribution curve, will be located only on one (right) side, since negative values living masses are considered incompatible with the conditions of the problem. In this regard, the tabular value-criterion should be determined at a double value of the significance level (i.e. at 2a; ia = 2 o 0.05 = 0.10). The criterion for testing the hypothesis is formulated as follows: the null hypothesis is rejected if > і2а.

This form of the critical region problem is called unilateral. The one-tailed test is more sensitive to type II errors, but its application is permissible only if the validity of this alternative hypothesis is proved.

Let's establish according to the tables (Appendix 3) the tabular value-criterion at a = 0.10 and k = 8, i0D0 = 1.86.

So, when using a one-tailed test, the null hypothesis is rejected, i.e. the criterion will be in the critical region (ifakg > i0d0; 2.14 > 1.86). Thus, the live weight of calves at birth in the group of cows with normal lactation duration is significantly higher. This conclusion is more accurate than that obtained from a two-tailed test, since here we use Additional Information to justify the correctness of the application of a one-sided criterion.

The same conclusion can be obtained by comparing the possible marginal error of two samples ea with the actual difference between the means.

Let us calculate the possible marginal error of the difference between the averages for two samples:

Comparing the marginal possible error with the actual difference in the means, we can draw a similar conclusion that the hypothesis put forward about the equality of the means does not agree with the results obtained.

We will consider testing the hypothesis for the case of dependent samples with equal numbers and equal variances using the following example.

Yes, there are sampling data on the productivity of mother cows and daughter cows (Table 7.3).

Table 7.3. Productivity of mother and daughter cows

It is necessary to test a statistical hypothesis about the mean difference between pairs of related observations in the population.

Since the observations of two samples are pairwise interconnected (dependent samples), it is necessary to compare not the difference between the means, but the average value of the differences between pairs of observations (u). Let's consider all the stages of the hypothesis testing procedure. 1. Let's formulate the null and alternative hypotheses:

With this alternative, a two-tailed test must be applied.

2. We take the significance level equal to a = 0.05.

3. The most powerful test for H0 is Student's u-test.

4. Calculate the average difference

5. Calculate the adjusted variance of the mean difference:

6. Determine the mean error of the mean difference:

7. Calculate the actual value-Student's criterion:

8. Set the number of degrees of freedom based on the number of pairs of interrelated differences:

9. Let's find the tabular value of Student's G-test for to= 4 and a = 0.05; V. = 2.78 (app. 3).

10. Let's compare the actual and tabular value of the criterion:

The actual value of the criterion is above the table. Therefore, the value of the average difference between the milk yields of the two samples is significant and the null hypothesis is rejected.

We get the same conclusions by comparing the possible marginal error with the actual average difference:

The marginal error shows that as a result of random variation, the average difference can reach 2.4 c. The actual average difference is higher:

So, according to the results of the study, it can be argued with a high degree of probability that differences in the values ​​of the average milk yields of mother cows and daughter cows are probable.

3. VERIFICATION OF THE HYPOTHESIS OF EQUALITY OF AVERAGES

Used to test the proposition that the mean of the two indicators represented by the samples are significantly different. There are three types of test: one for related samples, and two for disconnected samples (with the same and different variances). If the samples are not connected, then the hypothesis of equality of variances must first be tested in order to determine which of the criteria to use. As in the case of comparing variances, there are 2 ways to solve the problem, which we will consider using an example.

EXAMPLE 3. there is data on the number of sales of goods in two cities. Test at a significance level of 0.01 the statistical hypothesis that the average number of sales of a product in cities is different.

23 25 23 22 23 24 28 16 18 23 29 26 31 19
22 28 26 26 35 20 27 28 28 26 22 29

We use the Data Analysis package. Depending on the type of test, one of three is selected: "Paired two-sample t-test for means" - for connected samples, and "Two-sample t-test with the same variances" or "Two-sample t-test with different variances" - for disconnected samples. Call the test with the same variances, in the window that opens in the fields "Variable interval 1" and "Variable interval 2" enter links to data (A1-N1 and A2-L2, respectively), if there are data labels, then check the box next to "Labels ” (we don’t have them, so the box is not checked). Next, enter the level of significance in the "Alpha" field - 0.01. Leave the Hypothetical Mean Difference field blank. In the "Output Options" section, put a checkmark next to "Output interval" and placing the cursor in the field opposite the inscription, left-click in cell B7. the output of the result will be carried out starting from this cell. By clicking on "OK" a table of results appears. Move the border between columns B and C, C and D, D and E, increasing the width of columns B, C and D so that all the labels fit. The procedure displays the main characteristics of the sample, t-statistics, critical values these statistics and critical levels significance "P(T<=t) одностороннее» и «Р(Т<=t) двухстороннее». Если по модулю t-статистика меньше критического, то средние показатели с заданной вероятностью равны. В нашем случае│-1,784242592│ < 2,492159469, следовательно, среднее число продаж значимо не отличается. Следует отметить, что если взять уровень значимости α=0,05, то результаты исследования будут совсем иными.



Two-sample t-test with equal variances

The average 23,57142857 26,41666667
Dispersion 17,34065934 15,35606061
Observations 14 12
Pooled variance 16,43105159
Hypothetical mean difference 0
df 24
t-statistic -1,784242592
P(T<=t) одностороннее 0,043516846
t critical unilateral 2,492159469
P(T<=t) двухстороннее 0,087033692
t critical double-sided 2,796939498

Lab #3

PAIR LINEAR REGRESSION

Purpose: To master the methods of constructing a linear pair regression equation using a computer, to learn how to obtain and analyze the main characteristics of the regression equation.

Consider the technique for constructing a regression equation using an example.

EXAMPLE. Samples of factors x i and y i are given. Based on these samples, find the linear regression equation ỹ = ax + b. Find the pair correlation coefficient. Check at the significance level a = 0.05 the regression model for adequacy.

X 0 1 2 3 4 5 6 7 8 9
Y 6,7 6,3 4,4 9,5 5,2 4,3 7,7 7,1 7,1 7,9

To find the coefficients a and b of the regression equation, use the functions SLOPE and INTERCEPT, category "Statistical". We enter the signature “a =” in A5, and in the adjacent cell B5 we enter the SLOPE function, put the cursor in the “Izv_value_u” field, set the link to cells B2-K2, circling them with the mouse. The result is 0.14303. Let us now find the coefficient b. We enter in A6 the signature “b =”, and in B6 the INTERCEPT function with the same parameters as the SLOPE function. The result is 5.976364. hence the linear regression equation is y=0.14303x+5.976364.

Let's plot the regression equation. To do this, in the third line of the table, we enter the values ​​of the function at the given points X (first line) - y (x 1). To obtain these values, use the TREND function of the Statistical category. We enter in A3 the signature "Y (X)" and, placing the cursor in B3, we call the TREND function. In the fields "From_value_y" and "From_value_x" we give a link to B2-K2 and B1-K1. in the "New_value_x" field, we also enter a link to B1-K1. in the field "Constant" enter 1 if the regression equation has the form y=ax+b, and 0 if y=ax. In our case, we enter the unit. The TREND function is an array, so to display all its values, select the B3-K3 area and press F2 and Ctrl+Shift+Enter. The result is the values ​​of the regression equation at the given points. We build a chart. We put the cursor in any free cell, call the diagram wizard, select the "Turned" category, the graph type is a line without dots (in the lower right corner), click "Next", in the "Diagnosis" field, enter a link to B3-K3. go to the “Row” tab and in the “X Values” field enter a link to B1-K1, click “Finish”. The result is a straight regression line. Let's see how the graphs of the experimental data and the regression equations differ. To do this, put the cursor in any free cell, call the diagram wizard, the “Graph” category, the graph type is a broken line with dots (second from the top left), click “Next”, in the “Range” field enter a link to the second and third lines B2- K3. go to the “Row” tab and in the “X-axis labels” field, enter a link to B1-K1, click “Finish”. The result is two lines (Blue - initial, red - regression equation). It can be seen that the lines differ little from each other.

a= 0,14303
b= 5,976364

The PEARSON function is used to calculate the correlation coefficient r xy. We place the chart so that they are located above line 25, and in A25 we make the signature “Correlation”, in B25 we call the PEARSON function, in the fields of which “Array 2” we enter a link to the initial data B1-K1 and B2-K2. the result is 0.993821. the coefficient of determination R xy is the square of the correlation coefficient r xy . In A26 we make the signature "Determination", and in B26 - the formula "=B25*B25". The result is 0.265207.

However, there is one function in Excel that calculates all the basic characteristics of linear regression. This is the LINEST function. We put the cursor in B28 and call the LINEST function, category "Statistical". In the fields "From_value_y" and "From_value_x" we give a link to B2-K2 and B1-K1. the "Constant" field has the same meaning as the TREND function, we have it equal to 1. The "Stat" field must contain 1 if you want to display full statistics about the regression. In our case, we put a unit there. The function returns an array of size 2 columns and 5 rows. After entering, select cell B28-C32 with the mouse and press F2 and Ctrl + Shift + Enter. The result is a table of values, the numbers in which have the following meaning:



Coefficient a

coefficient b

Standard error m o

Standard error m h

Determination coefficient R xy

Standard deviation y

F - statistics

Degrees of freedom n-2

Regression sum of squares S n 2

Residual sum of squares S n 2

0,14303 5,976364
0,183849 0,981484
0,070335 1,669889
0,60525 8
1,687758 22,30824

Analysis of the result: in the first line - the coefficients of the regression equation, compare them with the calculated functions SLOPE and INTERCEPT. The second line is the standard errors of the coefficients. If one of them is greater in absolute value than the coefficient itself, then the coefficient is considered zero. The coefficient of determination characterizes the quality of the connection between the factors. The obtained value of 0.070335 indicates a very good connection of factors, F - statistics checks the hypothesis of the adequacy of the regression model. This number must be compared with the critical value, to obtain it, we enter the signature “F-critical” in E33, and in F33 the function FDISP, the arguments of which we enter, respectively, “0.05” (significance level), “1” (number of factors X) and "8" (degrees of freedom).

F-critical 5,317655

It can be seen that the F-statistic is less than the F-critical, which means that the regression model is not adequate. The last line shows the regression sum of squares and residual sums of squares . It is important that the regression sum (explained by the regression) be much larger than the residual (not explained by the regression caused by random factors). In our case, this condition is not met, which indicates a bad regression.

Conclusion: In the course of work, I mastered the methods of constructing a linear pair regression equation using a computer, learned to obtain and analyze the main characteristics of the regression equation.


Lab #4

NONLINEAR REGRESSION

Purpose: to master the methods for constructing the main types of nonlinear pair regression equations with the help of a computer (internally linear models), to learn how to obtain and analyze the quality indicators of regression equations.

Let us consider the case when nonlinear models can be reduced to linear ones using data transformation (internally linear models).

EXAMPLE. Construct a regression equation y = f(x) for the sample x n y n (f = 1,2,…,10). As f (x), consider four types of functions - linear, power, exponential and hyperbola:

y = Ax + B; y = Ax B; y \u003d Ae Bx; y \u003d A / x + B.

It is necessary to find their coefficients A and B, and comparing the quality indicators, choose the function that best describes the dependence.

Profit Y 0,3 1,2 2,8 5,2 8,1 11,0 16,8 16,9 24,7 29,4
Profit X 0,25 0,50 0,75 1,00 1,25 1,50 1,75 2,00 2,25 2,50

Let's enter data into the table along with signatures (cells A1-K2). Let's leave free three lines below the table for entering the converted data, select the first five lines by swiping along the left gray border on the numbers from 1 to 5 and choose any color (light - yellow or pink) to colorize the background of the cells. Further, starting from A6, we derive the parameters of linear regression. To do this, in cell A6 we make the signature "Linear" and in the adjacent cell B6 we enter the LINEST function. In the fields "From_value_x" we give a link to B2-K2 and B1-K1, the next two fields take values ​​by one. Next, draw the area below in 5 lines and to the left in 2 lines and press F2 and Ctrl + Shift + Enter. The result is a table with regression parameters, of which the coefficient of determination in the first column is the third from the top. In our case, it is equal to R 1 = 0.951262. The value of the F-criterion, which allows you to check the adequacy of the model F 1 = 156.1439

(fourth row, first column). The regression equation is

y = 12.96 x +6.18 (coefficients a and b are given in cells B6 and C6).

Linear 12,96 -6,18
1,037152 1,60884
0,951262 2,355101
156,1439 8
866,052 44,372

Let us determine similar characteristics for other regressions and, as a result of comparing the coefficients of determination, we will find the best regression model. Consider hyperbolic regression. To get it, we transform the data. In the third line, in cell A3, enter the caption "1/x", and in cell B3, enter the formula "=1/B2". Let's stretch this cell by autofill to the area B3-K3. Let's get the characteristics of the regression model. In cell A12, we enter the signature "Hyperbola", and in the adjacent function LINEST. In the fields "From_value_y" and "From_value_x2" we give a link to B1-K1 and the converted data of the argument x - B3-K3, the next two fields take values ​​by one. Next, we circle the area below 5 lines and to the left in 2 lines and press F2 and Ctrl + Shift + Enter. We get a table of regression parameters. The coefficient of determination in this case is R 2 = 0.475661, which is much worse than in the case of linear regression. The F-statistic is F 2 = 7.257293. The regression equation is y = -6.25453x 18.96772 .

Hyperbola -6,25453 18,96772
2,321705 3,655951
0,475661 7,724727
7,257293 8
433,0528 477,3712

Consider exponential regression. To linearize it, we obtain the equation , where ỹ = ln y, ã = b, = ln a. It can be seen that data transformation needs to be done - replace y with ln y. We put the cursor in cell A4 and make the heading "ln y". We put the cursor in B4 and enter the formula LN (category "Mathematical"). As an argument, we make a reference to B1. Autocomplete extends the formula on the fourth line to cells B4-K4. Next, in cell F6, we set the label "Exponent" and in the adjacent G6 we enter the LINEST function, the arguments of which will be the converted data B4-K4 (in the field "Iv_value_y"), and the remaining fields are the same as for the case of linear regression (B2-K2, eleven). Next, circle cells G6-H10 and press F2 and Ctrl+Shift+Enter. The result is R 3 = 0.89079, F 3 = 65.25304, which indicates a very good regression. To find the coefficients of the regression equation b = ã; put the cursor in J6 and make the heading “a=”, and in the adjacent K6 the formula “=EXP(H6)”, in J7 we give the heading “b=”, and in K7 the formula “=G6”. The regression equation is y = 0.511707 e 6.197909 x .

Exhibitor 1,824212 -0,67 a= 0,511707
0,225827 0,350304 b= 6,197909
0,89079 0,512793
65,25304 8
17,15871 2,103652

Consider power regression. To linearize it, we obtain the equation ỹ = ã, where ỹ = ln y, = ln x, ã = b, = ln a. It can be seen that it is necessary to do a data transformation - replace y with ln y and replace x with ln x. We already have a line with ln y. Let's change the variables x. In cell A5 we give the signature "ln x", and in B5 we enter the formula LN (category "Mathematical"). As an argument, we make a reference to B2. Autocomplete extends the formula to the fifth row in cells B5-K5. Next, in cell F12 we set the label “Power” and in the adjacent G12 we enter the LINEST function, the arguments of which will be the converted data B4-K4 (in the field “Measured_value_y”), and B5-K5 (in the field “Measured_value_x”), the remaining fields are units. Next, free cells G12-H16 and press F2 and Ctrl+Shift+Enter. The result R 4 = 0.997716, F 4 = 3494.117, which indicates a good regression. To find the coefficients of the regression equation b = ã; put the cursor in J12 and make the heading “a=”, and in the adjacent K12 the formula “=EXP(H12)”, in J13 we give the heading “b=”, and in K13 the formula “=G12”. The regression equation is y = 4.90767/x + 7.341268.

Power 1,993512 1,590799 a= 4,90767
0,033725 0,023823 b= 7,341268
0,997716 0,074163
3494,117 8
19,21836 0,044002

Let's check if all the equations adequately describe the data. To do this, you need to compare the F-statistics of each criterion with a critical value. To obtain it, we enter in A21 the signature “F-critical”, and in B21 the function FDISP, the arguments of which we enter respectively “0.05” (significance level), “1” (the number of factors X in the line “Significance level 1”) and “ 8" (degree of freedom 2 = n - 2). The result is 5.317655. F - critical more than F - statistics means the model is adequate. The rest of the regressions are also adequate. In order to determine which model best describes the data, we compare the determination indices for each model R 1 , R 2 , R 3 , R 4 . The largest is R 4 = 0.997716. This means that it is better to describe the experimental data as y = 4.90767/x + 7.341268.

Conclusion: In the course of my work, I mastered the methods for constructing the main types of nonlinear pair regression equations with the help of a computer (internally linear models), learned how to obtain and analyze the quality indicators of regression equations.

Y 0,3 1,2 2,8 5,2 8,1 11 16,8 16,9 24,7 29,4
X 0,25 0,5 0,75 1 1,25 1,5 1,75 2 2,25 2,5
1/x 4 2 1,333333 1 0,8 0,666667 0,571429 0,5 0,444444 0,4
ln y -1,20397 0,182322 1,029619 1,648659 2,0918641 2,397895 2,821379 2,827314 3,206803 3,380995
ln x -1,38629 -0,69315 -0,28768 0 0,2231436 0,405465 0,559616 0,693147 0,81093 0,916291
Linear 12,96 -6,18 Exhibitor 1,824212 -0,67 a= 0,511707
1,037152 1,60884 0,225827 0,350304 b= 6,197909
0,951262 2,355101 0,89079 0,512793
156,1439 8 65,25304 8
866,052 44,372 17,15871 2,103652
Hyperbola -6,25453 18,96772 Power 1,993512 1,590799 a= 4,90767
2,321705 3,655951 0,033725 0,023823 b= 7,341268
0,475661 7,724727 0,997716 0,074163
7,257293 8 3494,117 8
433,0528 477,3712 19,21836 0,044002
F - critical 5,317655

Lab #5

POLYNOMINAL REGRESSION

Purpose: Based on experimental data, construct a regression equation of the form y \u003d ax 2 + bx + c.

WORKING PROCESS:

The dependence of the yield of a certain crop y i on the amount of mineral fertilizers х i introduced into the soil is considered. It is assumed that this dependence is quadratic. It is necessary to find a regression equation of the form ỹ = ax 2 + bx + c.

x 0 1 2 3 4 5 6 7 8 9
y 29,8 58,8 72,2 101,5 141 135,1 156,6 181,7 216,6 208,2

Let's enter this data into a spreadsheet along with signatures in cells A1-K2. Let's build a graph. To do this, circle the data Y (cells B2-K2), call the chart wizard, select the “Graph” chart type, the chart type is a graph with dots (second from the top left), click “Next”, go to the “Series” tab and in the “ X-Axis Labels" make a link to B2-K2, click "Finish". The graph can be approximated by a 2nd degree polynomial y \u003d ax 2 + bx + c. To find the coefficients a, b, c, you need to solve the system of equations:

Let's calculate the amounts. To do this, in cell A3, enter the signature "X ^ 2", and in B3 enter the formula "= B1 * B1" and Autocomplete transfer it to the entire line B3-K3. In cell A4, enter the signature "X ^ 3", and in B4 the formula "= B1 * B3" and Autocomplete transfer it to the entire line B4-K4. In cell A5, enter "X ^ 4", and in B5 the formula "= B4 * B1", auto-complete the line. In cell A6, enter "X * Y", and in B8 the formula "= B2 * B1", auto-fill the line. In cell A7, enter “X ^ 2 * Y”, and in B9 the formula “= B3 * B2”, auto-complete the line. Now we count the amounts. Highlight column L with a different color by clicking on the heading and choosing a color. We place the cursor in cell L1 and by clicking on the autosum button with the ∑ icon, we calculate the sum of the first row. Autocomplete transfers the formula to cells L1-710.

We now solve the system of equations. To do this, we introduce the main matrix of the system. In cell A13 we enter the signature "A =", and in the cells of the matrix B13-D15 we enter the links reflected in the table

B C D
13 =L5 =L4 =L3
14 =L3 =L2 =L1
15 =L2 =L1 =9

We also introduce the right parts of the system of equations. In G13 we enter the signature "B =", and in H13-H15 we enter, respectively, links to the cells "=L7", "=L6", "=L2". We solve the system by the matrix method. From higher mathematics it is known that the solution is equal to A -1 B. We find the inverse matrix. To do this, in cell J13, enter the signature "A arr." and, by placing the cursor in K13, we set the MIND formula (category "Mathematical"). As an argument "Array" we give a reference to cells B13: D15. The result should also be a 4x4 matrix. To get it, circle cells K13-M15 with the mouse, select them and press F2 and Ctrl + Shift + Enter. The result is matrix A -1 . Let us now find the product of this matrix and column B (cells H13-H15). We enter the caption “Coefficients” in cell A18 and in B18 we set the function MULTIPLE (category “Mathematical”). The arguments of the "Array 1" function is a reference to the matrix A -1 (cells K13-M15), and in the "Array 2" field we give a link to column B (cells H13-H16). Next, select B18-B20 and press F2 and Ctrl+Shift+Enter. The resulting array is the coefficients of the regression equation a, b, c. As a result, we obtain a regression equation of the form: y \u003d 1.201082x 2 - 5.619177x + 78.48095.

Let's plot the graphs of the initial data and those obtained on the basis of the regression equation. To do this, in cell A8 we enter the signature "Regression" and in B8 we enter the formula "=$B$18*B3+$B$19*B1+$B$20". Autocomplete transfers the formula to cells B8-K8. To build a graph, select cells B8-K8 and, holding down the Ctrl key, also select cells B2-M2. We call the chart wizard, select the “Chart” chart type, the chart type is a chart with dots (second from the top left), click “Next”, go to the “Series” tab and in the “X-Axis Labels” field make a link to B2-M2, click "Ready". It can be seen that the curves almost coincide.

CONCLUSION: in the process of work, I learned from experimental data to build a regression equation of the form y \u003d ax 2 + bx + c.





Empirical distribution density of a random analyzed variable and calculation of its characteristics. We determine the range of available data, i.e. difference between the largest and smallest sample values ​​(R = Xmax – Xmin): Choice of the number of grouping intervals k with the number of observations n<100 – ориентировочное значение интервалов можно рассчитать с использованием формулы Хайнхольда и Гаеде: ...

data, one can reliably judge the statistical relationships that exist between the variables that are investigated in this experiment. All methods of mathematical and statistical analysis are conditionally divided into primary and secondary. Methods are called primary, with the help of which it is possible to obtain indicators that directly reflect the results of measurements made in the experiment. Accordingly under...

General purpose processors (for example, in Excel, Lotus 1-2-3, etc.), as well as in some databases. Western statistical packages (SPSS, SAS, BMDP, etc.) have the following features: They allow processing huge amounts of data. Includes tools for describing tasks in the built-in language. They make it possible to build on their basis information processing systems for entire enterprises. Allow...



Massage course and within 1-2 months after it. 1.2 Forms of therapeutic massage The form of influence of therapeutic massage is divided into general and particular. These forms are characteristic of all types and methods of massage. Both private and general massage can be performed by the massage therapist in the form of mutual massage, couples or self-massage. 1.2.1 General massage A general massage is such a massage session (regardless of ...

x 0 1 2 3 4 5 6 7 8 9
y 29,8 58,8 72,2 101,5 141 135,1 156,6 181,7 216,6 208,2
X^2 0 1 4 9 16 25 36 49 64 81
X^3 0 1 8 27 64 125 216 343 512 729
X^4 0 1 16 81 256 625 1296 2401 4096 6561
X*Y 0 58,8 144,4 304,5 564 675,5 939,6 1271,9 1732,8 1873,8
X^2*Y 0 58,8 288,8 913,5 2256 3377,5 5637,6 8903,3 13862,4 16864,2
Regression. 78,48095 85,30121 94,52364 106,1482 120,175 136,6039 155,435 176,6682 200,3036 226,3412
A= 15333 2025 285 B= 52162,1 A Rev. 0,003247 -0,03247 0,059524
2025 285 45 7565,3 -0,03247 0,341342 -0,67857
285 45 9 1301,5 0,059524 -0,67857 1,619048
Coefficient 1,201082 a
5,619177

Comparison of the averages of two populations is of great practical importance. In practice, there is often a case when the average result of one series of experiments differs from the average result of another series. In this case, the question arises whether the observed discrepancy between the averages can be explained by the inevitable random errors of the experiment, or whether it is caused by certain regularities. In industry, the task of comparing averages often arises when sampling the quality of products manufactured on different installations or under different technological regimes, in financial analysis - when comparing the level of profitability of various assets, etc.

Let's formulate the task. Let there be two populations characterized by general means and and known variances and. It is necessary to test the hypothesis about the equality of the general averages, i.e. :=. To test the hypothesis, two independent samples of volumes and were taken from these populations, for which the arithmetic means and and sample variances and were found. With sufficiently large sample sizes, the sample means and have an approximately normal distribution law, respectively, and. If the hypothesis is true, the difference - has a normal distribution law with mathematical expectation and dispersion.

Therefore, when the hypothesis is fulfilled, the statistics

has a standard normal distribution N(0; 1).

Testing hypotheses about numerical values ​​of parameters

Hypotheses about numerical values ​​occur in various problems. Let be the values ​​of some parameter of products produced by the automatic line machine, and let be the given nominal value of this parameter. Each individual value can, of course, somehow deviate from the given face value. Obviously, in order to check the correct settings of this machine, you need to make sure that the average value of the parameter for the products produced on it will correspond to the nominal value, i.e. test a hypothesis against an alternative, or, or

With an arbitrary setting of the machine, it may be necessary to test the hypothesis that the accuracy of manufacturing products for a given parameter, given by dispersion, is equal to a given value, i.e. or, for example, the fact that the proportion of defective products produced by the machine is equal to the given value p 0 , i.e. etc.

Similar problems may arise, for example, in financial analysis, when, according to the sample data, it is necessary to establish whether the return on an asset of a certain type or portfolio of securities can be considered, or its risk equal to a given number; or, based on the results of a selective audit of similar documents, you need to make sure whether the percentage of errors made can be considered equal to the face value, etc.

In the general case, hypotheses of this type have the form, where is a certain parameter of the distribution under study, and is the area of ​​its specific values, consisting in a particular case of one value.

8.1. The concept of dependent and independent samples.

Choosing a criterion for testing a hypothesis

is primarily determined by whether the samples under consideration are dependent or independent. Let us introduce the corresponding definitions.

Def. The samples are called independent, if the procedure for selecting units in the first sample is in no way connected with the procedure for selecting units in the second sample.

An example of two independent samples is the above-discussed samples of men and women working in the same enterprise (in the same industry, etc.).

Note that the independence of two samples does not mean that there is no requirement for a certain kind of similarity of these samples (their homogeneity). Thus, studying the level of income of men and women, we are unlikely to allow such a situation when men are selected from the environment of Moscow businessmen, and women from the aborigines of Australia. Women should also be Muscovites and, moreover, “businesswomen”. But here we are not talking about the dependence of samples, but about the requirement of homogeneity of the studied set of objects, which must be satisfied both in the collection and analysis of sociological data.

Def. The samples are called dependent, or paired, if each unit of one sample is "tied" to a specific unit of the second sample.

The last definition will probably become clearer if we give an example of dependent samples.

Suppose we want to find out whether the social status of the father is, on average, lower than the social status of the son (we believe that we can measure this complex and ambiguous social characteristic of a person). It seems obvious that in such a situation it is expedient to select pairs of respondents (father, son) and assume that each element of the first sample (one of the fathers) is “tied” to a certain element of the second sample (his son). These two samples will be called dependent.

8.2. Hypothesis testing for independent samples

For independent selection of the criterion depends on whether we know the general variances s 1 2 and s 2 2 of the feature under consideration for the studied samples. We will consider this problem solved, assuming that the sample variances coincide with the general ones. In this case, the criterion is the value:

Before proceeding to a discussion of the situation when the general variances (or at least one of them) are unknown to us, we note the following.

The logic of using the criterion (8.1) is similar to that which was described by us when considering the criterion “Chi-square” (7.2). There is only one fundamental difference. Speaking about the meaning of criterion (7.2), we considered an infinite number of samples of size n, "scooped" from our general population. Here, analyzing the meaning of the criterion (8.1), we pass to the consideration of an infinite number steam samples of size n 1 and n 2 . For each pair and , a statistic of the form (8.1) is calculated. The set of obtained values ​​of such statistics, in accordance with our notation, corresponds to the normal distribution (as we agreed, the letter z is used to designate such a criterion, which corresponds to the normal distribution).

So, if the general variances are unknown to us, then we are forced to use their sample estimates s 1 2 and s 2 2 instead. However, in this case, the normal distribution should be replaced by the Student's distribution - z should be replaced by t (as was the case in a similar situation when constructing a confidence interval for the mathematical expectation). However, for sufficiently large sample sizes (n 1 , n 2 ³ 30), as we already know, the Student's distribution practically coincides with the normal one. In other words, with large samples, we can continue to use the criterion:

The situation is more complicated when both the variances are unknown and the size of at least one sample is small. Then another factor comes into play. The type of criterion depends on whether we can consider the unknown variances of the considered feature in the two analyzed samples to be equal. To find out, we need to test the hypothesis:

H 0: s 1 2 = s 2 2 . (8.3)

To test this hypothesis, the criterion is used

The specifics of using this criterion will be discussed below, and now we will continue to discuss the algorithm for choosing a criterion that uses mathematical expectations to test hypotheses about equality.

If hypothesis (8.3) is rejected, then the criterion of interest to us takes the form:

(8.5)

(i.e., it differs from the test (8.2) used for large samples in that the corresponding statistic has not a normal distribution, but a Student's distribution). If the hypothesis (8.3) is accepted, then the type of criterion used changes:

(8.6)

Let us sum up how the criterion is chosen to test the hypothesis of equality of general mathematical expectations based on the analysis of two independent samples.

known

unknown

sample size is large

H 0: s 1 = s 2 is rejected

accepted

8.3. Hypothesis testing for dependent samples

Let's move on to considering dependent samples. Let sequences of numbers

X 1 , X 2 , … , X n ;

Y 1 , Y 2 , … , Y n –

these are the values ​​of the considered random for the elements of two dependent samples. Let's introduce the notation:

D i = X i - Y i , i = 1, ... , n.

For dependent sampling criterion that allows you to test a hypothesis

as follows:

Note that the just given expression for s D is nothing but a new expression for the well-known formula expressing the standard deviation. In this case, we are talking about the standard deviation of the values ​​D i . Such a formula is often used in practice as a simpler (compared with the "frontal" calculation of the sum of squared deviations of the values ​​of the considered value from the corresponding arithmetic mean) method for calculating the variance.

If we compare the above formulas with those that we used when discussing the principles of constructing a confidence interval, it is easy to see that testing the hypothesis about the equality of means for the case of dependent samples is essentially testing the equality of the mathematical expectation of the values ​​D i to zero. Value

is the standard deviation for D i . Therefore, the value of the just described criterion t n -1 is essentially equal to the value of D i expressed in fractions of the standard deviation. As we said above (when discussing methods for constructing confidence intervals), this indicator can be used to judge the probability of the considered value D i . The difference is that above we were talking about a simple arithmetic mean, normally distributed, and here we are talking about average differences, such averages have Student's distribution. But arguments about the relationship between the probability of deviation of the sample arithmetic mean from zero (with mathematical expectation equal to zero) and how many units of s this deviation is remain valid.