Mathematics and informatics. Study guide throughout the course

First, let's recall the following definition:

Let's consider the following situation. Let the variants of the general population have a normal distribution with mathematical expectation $a$ and standard deviation $\sigma $. The sample mean in this case will be considered as a random variable. When $X$ is normally distributed, the sample mean will also have a normal distribution with parameters

Let's find confidence interval, which covers $a$ with reliability $\gamma $.

To do this, we need the equality

From it we get

From here we can easily find $t$ from the table of values ​​of the function $Ф\left(t\right)$ and, as a result, find $\delta $.

Recall the table of values ​​of the function $Ф\left(t\right)$:

Figure 1. Table of values ​​of the function $Ф\left(t\right).$

Confidence integral for estimating the expectation when $(\mathbf \sigma )$ is unknown

In this case, we will use the value of the corrected variance $S^2$. Replacing $\sigma $ in the above formula with $S$, we get:

An example of tasks for finding a confidence interval

Example 1

Let the quantity $X$ have a normal distribution with variance $\sigma =4$. Let the sample size be $n=64$ and the reliability equal to $\gamma =0.95$. Find the confidence interval for estimating the mathematical expectation of the given distribution.

We need to find the interval ($\overline(x)-\delta ,\overline(x)+\delta)$.

As we saw above

\[\delta =\frac(\sigma t)(\sqrt(n))=\frac(4t)(\sqrt(64))=\frac(\ t)(2)\]

We find the parameter $t$ from the formula

\[Ф\left(t\right)=\frac(\gamma )(2)=\frac(0.95)(2)=0.475\]

From table 1 we get that $t=1.96$.

And others. All of them are estimates of their theoretical counterparts, which could be obtained if there were not a sample, but population. But alas, the general population is very expensive and often unavailable.

The concept of interval estimation

Any sample estimate has some scatter, because is a random variable depending on the values ​​in a particular sample. Therefore, for more reliable statistical inferences, one should know not only the point estimate, but also the interval, which with a high probability γ (gamma) covers the estimated indicator θ (theta).

Formally, these are two such values ​​(statistics) T1(X) and T2(X), what T1< T 2 , for which at a given level of probability γ condition is met:

In short, it is likely γ or more the true value is between the points T1(X) and T2(X), which are called the lower and upper bounds confidence interval.

One of the conditions for constructing confidence intervals is its maximum narrowness, i.e. it should be as short as possible. Desire is quite natural, because. the researcher tries to more accurately localize the finding of the desired parameter.

It follows that the confidence interval should cover the maximum probabilities of the distribution. and the score itself be at the center.

That is, the probability of a deviation (of the true indicator from the estimate) in big side is equal to the probability of downward deviation. It should also be noted that for skewed distributions, the interval on the right is not equal to the interval on the left.

The figure above clearly shows that the greater the confidence level, the wider the interval - a direct relationship.

This was a small introduction to the theory of interval estimation of unknown parameters. Let's move on to finding confidence limits for the mathematical expectation.

Confidence interval for mathematical expectation

If the original data are distributed over , then the average will be a normal value. This follows from the rule that a linear combination of normal values ​​also has a normal distribution. Therefore, to calculate the probabilities, we could use the mathematical apparatus of the normal distribution law.

However, this will require the knowledge of two parameters - the expected value and the variance, which are usually not known. You can, of course, use estimates instead of parameters (arithmetic mean and ), but then the distribution of the mean will not be quite normal, it will be slightly flattened down. Citizen William Gosset of Ireland adroitly noted this fact when he published his discovery in the March 1908 issue of Biometrica. For secrecy purposes, Gosset signed with Student. This is how the Student's t-distribution appeared.

However, the normal distribution of data, used by K. Gauss in the analysis of errors in astronomical observations, is extremely rare in terrestrial life and it is quite difficult to establish this (for high accuracy, about 2 thousand observations are needed). Therefore, it is best to drop the normality assumption and use methods that do not depend on the distribution of the original data.

The question arises: what is the distribution of the arithmetic mean if it is calculated from the data of an unknown distribution? The answer is given by the well-known in probability theory Central limit theorem(CPT). In mathematics, there are several versions of it (the formulations have been refined over the years), but, roughly speaking, they all boil down to the statement that the sum a large number independent random variables obeys the normal distribution law.

When calculating the arithmetic mean, the sum of random variables is used. From this it turns out that the arithmetic mean has a normal distribution, in which the expected value is the expected value of the original data, and the variance is .

Smart people know how to prove the CLT, but we will verify this with the help of an experiment conducted in Excel. Let's simulate a sample of 50 uniformly distributed random variables (using Excel functions RANDOMBETWEEN). Then we will make 1000 such samples and calculate the arithmetic mean for each. Let's look at their distribution.

It can be seen that the distribution of the average is close to the normal law. If the volume of samples and their number are made even larger, then the similarity will be even better.

Now that we have seen for ourselves the validity of the CLT, we can, using , calculate the confidence intervals for the arithmetic mean, which, with a given probability, cover the true mean or expected value.

To establish the upper and lower bounds, it is required to know the parameters of the normal distribution. As a rule, they are not, therefore, estimates are used: arithmetic mean and sample variance. Again, this method gives a good approximation only for large samples. When the samples are small, it is often recommended to use Student's distribution. Don't believe! Student's distribution for the mean occurs only when the original data has a normal distribution, that is, almost never. Therefore, it is better to immediately set the minimum bar for the amount of required data and use asymptotically correct methods. They say 30 observations are enough. Take 50 - you can't go wrong.

T 1.2 are the lower and upper bounds of the confidence interval

– sample arithmetic mean

s0– sample standard deviation (unbiased)

n – sample size

γ – confidence level (usually equal to 0.9, 0.95 or 0.99)

c γ =Φ -1 ((1+γ)/2) is the reciprocal of the standard normal distribution function. In simple terms, this is the number of standard errors from the arithmetic mean to the lower or upper bound (the indicated three probabilities correspond to the values ​​\u200b\u200bof 1.64, 1.96 and 2.58).

The essence of the formula is that the arithmetic mean is taken and then a certain amount is set aside from it ( with γ) standard errors ( s 0 /√n). Everything is known, take it and count.

Before the mass use of PCs, to obtain the values ​​​​of the normal distribution function and its inverse, they used . They are still being used, but it is more efficient to turn to ready-made Excel formulas. All elements from the formula above ( , and ) can be easily calculated in Excel. But there is also a ready-made formula for calculating the confidence interval - CONFIDENCE NORM. Its syntax is the following.

CONFIDENCE NORM(alpha, standard_dev, size)

alpha– significance level or confidence level, which in the above notation is equal to 1-γ, i.e. the probability that the mathematicalthe expectation will be outside the confidence interval. With a confidence level of 0.95, alpha is 0.05, and so on.

standard_off is the standard deviation of the sample data. You don't need to calculate the standard error, Excel will divide by the root of n.

the size– sample size (n).

The result of the CONFIDENCE.NORM function is the second term from the formula for calculating the confidence interval, i.e. half-interval. Accordingly, the lower and upper points are the average ± the obtained value.

Thus, it is possible to build a universal algorithm for calculating confidence intervals for the arithmetic mean, which does not depend on the distribution of the initial data. The price for universality is its asymptotic nature, i.e. the need to use relatively large samples. However, in the century modern technologies collecting the right amount of data is usually not difficult.

Testing Statistical Hypotheses Using a Confidence Interval

(module 111)

One of the main problems solved in statistics is. In a nutshell, its essence is this. An assumption is made, for example, that the expectation of the general population is equal to some value. Then the distribution of sample means is constructed, which can be observed with a given expectation. Next, we look at where in this conditional distribution the real average is located. If it goes beyond the allowable limits, then the appearance of such an average is very unlikely, and with a single repetition of the experiment it is almost impossible, which contradicts the hypothesis put forward, which is successfully rejected. If the mean does not go beyond critical level, then the hypothesis is not rejected (but not proven!).

So, with the help of confidence intervals, in our case for the expectation, you can also test some hypotheses. It's very easy to do. Suppose the arithmetic mean for some sample is 100. The hypothesis is being tested that the expected value is, say, 90. That is, if we put the question primitively, it sounds like this: can it be that with the true value of the average equal to 90, the observed the average was 100?

To answer this question, additional information on the average standard deviation and sample size. Let's say standard deviation is 30 and the number of observations is 64 (to easily extract the root). Then the standard error of the mean is 30/8 or 3.75. To calculate the 95% confidence interval, it will be necessary to defer to both sides of the average by two standard errors(more precisely, by 1.96). The confidence interval will be approximately 100 ± 7.5, or from 92.5 to 107.5.

Further reasoning is as follows. If the tested value falls within the confidence interval, then it does not contradict the hypothesis, since fits within the limits of random fluctuations (with a probability of 95%). If the tested point is outside the confidence interval, then the probability of such an event is very small, in any case lower acceptable level. Hence, the hypothesis is rejected as contradicting the observed data. In our case, the expectation hypothesis is outside the confidence interval (the tested value of 90 is not included in the interval of 100±7.5), so it should be rejected. Answering the primitive question above, one should say: no, it cannot, in any case, this happens extremely rarely. Often, this indicates a specific probability of erroneous rejection of the hypothesis (p-level), and not a given level, according to which the confidence interval was built, but more on that another time.

As you can see, it is not difficult to build a confidence interval for the mean (or mathematical expectation). The main thing is to catch the essence, and then things will go. In practice, most use the 95% confidence interval, which is about two standard errors wide on either side of the mean.

That's all for now. All the best!

You can use this search form to find desired task. Enter a word, a phrase from the task or its number if you know it.


Search only in this section


Confidence Intervals: List of Problem Solutions

Confidence intervals: theory and problems

Understanding Confidence Intervals

Let us briefly introduce the concept of a confidence interval, which
1) estimates some parameter of a numerical sample directly from the data of the sample itself,
2) covers the value of this parameter with probability γ.

Confidence interval for parameter X(with probability γ) is called an interval of the form , such that , and the values ​​are computed in some way from the sample .

Usually in applied tasks the confidence level is taken equal to γ ​​= 0.9; 0.95; 0.99.

Consider some sample of size n, made from the general population, distributed presumably according to the normal distribution law. Let us show by what formulas are found confidence intervals for distribution parameters- mathematical expectation and dispersion (standard deviation).

Confidence interval for mathematical expectation

Case 1 The distribution variance is known and equal to . Then the confidence interval for the parameter a looks like:
t is determined from the Laplace distribution table by the ratio

Case 2 The distribution variance is unknown; a point estimate of the variance was calculated from the sample. Then the confidence interval for the parameter a looks like:
, where is the sample mean calculated from the sample, parameter t determined from Student's distribution table

Example. Based on the data of 7 measurements of a certain value, the average of the measurement results was found equal to 30 and the sample variance equal to 36. Find the boundaries in which the true value of the measured value is contained with a reliability of 0.99.

Decision. Let's find . Then the confidence limits for the interval containing the true value of the measured value can be found by the formula:
, where is the sample mean, is the sample variance. Plugging in all the values, we get:

Confidence interval for variance

We believe that, generally speaking, the mathematical expectation is unknown, and only a point unbiased estimate of the variance is known. Then the confidence interval looks like:
, where - distribution quantiles determined from tables.

Example. Based on the data of 7 tests, the value of the estimate for the standard deviation was found s=12. Find with a probability of 0.9 the width of the confidence interval built to estimate the variance.

Decision. The confidence interval for the unknown population variance can be found using the formula:

Substitute and get:


Then the width of the confidence interval is 465.589-71.708=393.881.

Confidence interval for probability (percentage)

Case 1 Let the sample size and sample fraction (relative frequency) be known in the problem. Then the confidence interval for the general fraction (true probability) is:
, where the parameter t is determined from the Laplace distribution table by the ratio .

Case 2 If the problem additionally knows the total size of the population from which the sample was taken, the confidence interval for the general fraction (true probability) can be found using the adjusted formula:
.

Example. It is known that Find the boundaries in which the general share is concluded with probability.

Decision. We use the formula:

Let's find the parameter from the condition , we get Substitute in the formula:


Other examples of tasks for mathematical statistics you will find on the page

Let CB X form the population and β be an unknown parameter CB X. If the statistical estimate in * is consistent, then the larger the sample size, the more accurate the value of β. However, in practice, we have not very large samples, so we cannot guarantee greater accuracy.

Let s* be a statistical estimate for s. Quantity |in* - in| is called the estimation accuracy. It is clear that the precision is CB, since s* is a random variable. Let's set a small positive number 8 and require that the accuracy of the estimate |in* - in| was less than 8, i.e. | in* - in |< 8.

Reliability g or confidence level estimate in by in * is the probability g with which the inequality |in * - in|< 8, т. е.

Usually, the reliability of g is set in advance, and, for g, they take a number close to 1 (0.9; 0.95; 0.99; ...).

Since the inequality |in * - in|< S равносильно двойному неравенству в* - S < в < в* + 8, то получаем:

The interval (in * - 8, in * + 5) is called the confidence interval, i.e., the confidence interval covers the unknown parameter in with probability y. Note that the ends of the confidence interval are random and vary from sample to sample, so it is more accurate to say that the interval (at * - 8, at * + 8) covers the unknown parameter β rather than β belongs to this interval.

Let the general population be given by a random variable X, distributed according to the normal law, moreover, the average standard deviation but it is known. The mathematical expectation a = M (X) is unknown. It is required to find a confidence interval for a for a given reliability y.

Sample mean

is a statistical estimate for xr = a.

Theorem. Random value xB is normally distributed if X is normally distributed, and M(xB) = a,

A (XB) \u003d a, where a \u003d y / B (X), a \u003d M (X). l/i

The confidence interval for a has the form:

We find 8.

Using the ratio

where Ф(г) is the Laplace function, we have:

P ( | XB - a |<8} = 2Ф

we find the value of t in the table of values ​​of the Laplace function.

Denoting

T, we get F(t) = g

From the equality Find - the accuracy of the estimate.

So the confidence interval for a has the form:

If a sample is given from the general population X

ng to" X2 xm
n. n1 n2 nm

n = U1 + ... + nm, then the confidence interval will be:

Example 6.35. Find the confidence interval for estimating the expectation a of a normal distribution with a reliability of 0.95, knowing the sample mean Xb = 10.43, the sample size n = 100, and the standard deviation s = 5.

Let's use the formula

Let's build a confidence interval in MS EXCEL for estimating the mean value of the distribution in the case of a known value of the variance.

Of course the choice level of trust completely depends on the task at hand. Thus, the degree of confidence of the air passenger in the reliability of the aircraft, of course, should be higher than the degree of confidence of the buyer in the reliability of the light bulb.

Task Formulation

Let's assume that from population having taken sample size n. It is assumed that standard deviation this distribution is known. Necessary on the basis of this samples evaluate the unknown distribution mean(μ, ) and construct the corresponding bilateral confidence interval.

Point Estimation

As is known from statistics(let's call it X cf) is an unbiased estimate of the mean this population and has the distribution N(μ;σ 2 /n).

Note: What if you need to build confidence interval in the case of distribution, which is not normal? In this case, comes to the rescue, which says that with a sufficiently large size samples n from distribution non- normal, sampling distribution of statistics Х av will approximately correspond normal distribution with parameters N(μ;σ 2 /n).

So, point estimate middle distribution values we have is sample mean, i.e. X cf. Now let's get busy confidence interval.

Building a confidence interval

Usually, knowing the distribution and its parameters, we can calculate the probability that a random variable will take a value from the interval we specified. Now let's do the opposite: find the interval in which the random variable falls with a given probability. For example, from properties normal distribution it is known that with a probability of 95%, a random variable distributed over normal law, will fall within the interval approximately +/- 2 from mean value(see article about). This interval will serve as our prototype for confidence interval.

Now let's see if we know the distribution , to calculate this interval? To answer the question, we must specify the form of distribution and its parameters.

We know the form of distribution is normal distribution(remember that we are talking about sampling distribution statistics X cf).

The parameter μ is unknown to us (it just needs to be estimated using confidence interval), but we have its estimate X cf, calculated based on sample, which can be used.

The second parameter is sample mean standard deviation will be known, it is equal to σ/√n.

Because we do not know μ, then we will build the interval +/- 2 standard deviations not from mean value, but from its known estimate X cf. Those. when calculating confidence interval we will NOT assume that X cf will fall within the interval +/- 2 standard deviations from μ with a probability of 95%, and we will assume that the interval is +/- 2 standard deviations from X cf with a probability of 95% will cover μ - the average of the general population, from which sample. These two statements are equivalent, but the second statement allows us to construct confidence interval.

In addition, we refine the interval: a random variable distributed over normal law, with a 95% probability falls within the interval +/- 1.960 standard deviations, not +/- 2 standard deviations. This can be calculated using the formula \u003d NORM.ST.OBR ((1 + 0.95) / 2), cm. sample file Sheet Spacing.

Now we can formulate a probabilistic statement that will serve us to form confidence interval:
"The probability that population mean located from sample average within 1.960" standard deviations of the sample mean", is equal to 95%.

The probability value mentioned in the statement has a special name , which is associated with significance level α (alpha) by a simple expression trust level =1 . In our case significance level α =1-0,95=0,05 .

Now, based on this probabilistic statement, we write an expression for calculating confidence interval:

where Zα/2 standard normal distribution(such a value of a random variable z, what P(z>=Zα/2 )=α/2).

Note: Upper α/2-quantile defines the width confidence interval in standard deviations sample mean. Upper α/2-quantile standard normal distribution is always greater than 0, which is very convenient.

In our case, at α=0.05, upper α/2-quantile equals 1.960. For other significance levels α (10%; 1%) upper α/2-quantile Zα/2 can be calculated using the formula \u003d NORM.ST.OBR (1-α / 2) or, if known trust level, =NORM.ST.OBR((1+confidence level)/2).

Usually when building confidence intervals for estimating the mean use only upper α/2-quantile and do not use lower α/2-quantile. This is possible because standard normal distribution symmetrical about the x-axis ( density of its distribution symmetrical about average, i.e. 0). Therefore, there is no need to calculate lower α/2-quantile(it is simply called α /2-quantile), because it is equal upper α/2-quantile with a minus sign.

Recall that, regardless of the shape of the distribution of x, the corresponding random variable X cf distributed approximately fine N(μ;σ 2 /n) (see article about). Therefore, in general, the above expression for confidence interval is only approximate. If x is distributed over normal law N(μ;σ 2 /n), then the expression for confidence interval is accurate.

Calculation of confidence interval in MS EXCEL

Let's solve the problem.
The response time of an electronic component to an input signal is an important characteristic of a device. An engineer wants to plot a confidence interval for the average response time at a confidence level of 95%. From previous experience, the engineer knows that the standard deviation of the response time is 8 ms. It is known that the engineer made 25 measurements to estimate the response time, the average value was 78 ms.

Decision: An engineer wants to know the response time of an electronic device, but he understands that the response time is not fixed, but a random variable that has its own distribution. So the best he can hope for is to determine the parameters and shape of this distribution.

Unfortunately, from the condition of the problem, we do not know the form of the distribution of the response time (it does not have to be normal). , this distribution is also unknown. Only he is known standard deviationσ=8. Therefore, while we cannot calculate the probabilities and construct confidence interval.

However, although we do not know the distribution time separate response, we know that according to CPT, sampling distribution average response time is approximately normal(we will assume that the conditions CPT are performed, because the size samples large enough (n=25)) .

Furthermore, the average this distribution is equal to mean value unit response distributions, i.e. μ. BUT standard deviation of this distribution (σ/√n) can be calculated using the formula =8/ROOT(25) .

It is also known that the engineer received point estimate parameter μ equal to 78 ms (X cf). Therefore, now we can calculate the probabilities, because we know the distribution form ( normal) and its parameters (Х ср and σ/√n).

Engineer wants to know expected valueμ of the response time distribution. As stated above, this μ is equal to expectation of the sample distribution of the average response time. If we use normal distribution N(X cf; σ/√n), then the desired μ will be in the range +/-2*σ/√n with a probability of approximately 95%.

Significance level equals 1-0.95=0.05.

Finally, find the left and right border confidence interval.
Left border: \u003d 78-NORM.ST.INR (1-0.05 / 2) * 8 / ROOT (25) = 74,864
Right border: \u003d 78 + NORM. ST. OBR (1-0.05 / 2) * 8 / ROOT (25) \u003d 81.136

Left border: =NORM.INV(0.05/2, 78, 8/SQRT(25))
Right border: =NORM.INV(1-0.05/2, 78, 8/SQRT(25))

Answer: confidence interval at 95% confidence level and σ=8msec equals 78+/-3.136ms

AT example file on sheet Sigma known created a form for calculation and construction bilateral confidence interval for arbitrary samples with a given σ and significance level.

CONFIDENCE.NORM() function

If the values samples are in the range B20:B79 , a significance level equal to 0.05; then MS EXCEL formula:
=AVERAGE(B20:B79)-CONFIDENCE(0.05,σ, COUNT(B20:B79))
will return the left border confidence interval.

The same boundary can be calculated using the formula:
=AVERAGE(B20:B79)-NORM.ST.INV(1-0.05/2)*σ/SQRT(COUNT(B20:B79))

Note: The TRUST.NORM() function appeared in MS EXCEL 2010. Earlier versions of MS EXCEL used the TRUST() function.