Create an interval variation distribution series. Statistical summary and grouping

If the random variable under study is continuous, then ranking and grouping of observed values often does not allow identifying the characteristic features of variation in its values. This is explained by the fact that individual values of a random variable can differ from each other as little as desired, and therefore, in the totality of observed data, identical values of a quantity can rarely occur, and the frequencies of variants differ little from each other.

It is also impractical to construct a discrete series for a discrete random variable, the number of possible values of which is large. In such cases, you should build interval variation series distributions.

To construct such a series, the entire interval of variation of the observed values of a random variable is divided into a series partial intervals and counting the frequency of occurrence of the value values in each partial interval.

Interval variation series call an ordered set of intervals of varying values of a random variable with corresponding frequencies or relative frequencies of values of the variable falling into each of them.

To build an interval series you need:

define size partial intervals;
define width intervals;
set it for each interval top And lower limit ;
group the observation results.

1 . The question of choosing the number and width of grouping intervals has to be decided in each specific case based on goals research, volume sampling and degree of variation characteristic in the sample.

Approximately number of intervals k can be estimated based only on sample size n in one of the following ways:

according to the formula Sturges : k = 1 + 3.32 log n ;
using table 1.

Table 1

2 . Spaces of equal width are generally preferred. To determine the width of intervals h calculate:

range of variation R - sample values: R = x max - x min ,

Where xmax And xmin - maximum and minimum sampling options;

width of each interval h determined by the following formula: h = R/k .

3 . Bottom line first interval x h1 is selected so that the minimum sample option xmin fell approximately in the middle of this interval: x h1 = x min - 0.5 h .

Intermediate intervals obtained by adding the length of the partial interval to the end of the previous interval h :

xhi = xhi-1 +h.

The construction of an interval scale based on the calculation of interval boundaries continues until the value x hi satisfies the relation:

x hi< x max + 0,5·h .

4 . In accordance with the interval scale, the characteristic values are grouped - for each partial interval the sum of frequencies is calculated n i option included in i th interval. In this case, the interval includes values of the random variable that are greater than or equal to the lower limit and less than the upper limit of the interval.

Polygon and histogram

For clarity, various statistical distribution graphs are constructed.

Based on the data of a discrete variation series, they construct polygon frequencies or relative frequencies.

Frequency polygon x 1 ; n 1 ), (x 2 ; n 2 ), ..., (x k ; n k ). To construct a frequency polygon, options are plotted on the abscissa axis. x i , and on the ordinate - the corresponding frequencies n i . Points ( x i ; n i ) are connected by straight segments and a frequency polygon is obtained (Fig. 1).

Polygon of relative frequencies called a broken line whose segments connect points ( x 1 ; W 1 ), (x 2 ; W 2 ), ..., (x k ; Wk ). To construct a polygon of relative frequencies, options are plotted on the abscissa axis x i , and on the ordinate - the corresponding relative frequencies W i . Points ( x i ; W i ) are connected by straight segments and a polygon of relative frequencies is obtained.

When continuous feature it is expedient to build histogram .

Frequency histogram called a stepped figure consisting of rectangles, the bases of which are partial intervals of length h , and the heights are equal to the ratio n i/h (frequency density).

To construct a frequency histogram, partial intervals are laid out on the abscissa axis, and segments parallel to the abscissa axis are drawn above them at a distance n i/h .

Number of groups (intervals) is approximately determined by the Sturgess formula:

m = 1 + 3.322 × log(n)

where n is the total number of observation units (the total number of elements in the population, etc.), log(n) is the decimal logarithm of n.

Received according to the Sturgess formula, the value is usually rounded to the nearest whole number numbers, since the number of groups cannot be a fractional number.

If an interval series with so many groups is not satisfactory for some criteria, then you can build another interval series by rounding m to a smaller integer and choose the more suitable one from the two rows.

The number of groups should not be more than 15.

You can also use the following table if it is not possible to calculate the decimal logarithm at all.

Determining the width of the interval

Interval width for an interval variation series with equal intervals is determined by the formula:

where X max is the maximum of the values of x i, X min is the minimum of the values of x i; m - number of groups (intervals).

The size of the interval (i ) is usually rounded to the nearest whole number, the only exceptions are cases when the slightest fluctuations of a feature are studied (for example, when grouping parts according to the size of deviations from the nominal value, measured in fractions of a millimeter).

The following rule is often used:

Number of decimal places	A number of simbols after comma	Example of interval width using the formula	To what sign do we round?	Example of rounded spacing width

Determining the boundaries of the intervals

Lower limit first interval is taken equal to the minimum value of the attribute (most often it is preliminarily rounded to a smaller integer with the same digit as the width of the interval). For example, x min = 15, i=130, x n of the first interval = 10.

x n1 ≈ x min

Upper limit the first interval corresponds to the value (Xmin + i).

The lower limit of the second interval is always equal to the upper limit of the first interval. For subsequent groups, the boundaries are determined similarly, i.e., the value of the interval is successively added.

x V i = x n i +i

x n i = x V i-1

Determine the frequencies of the intervals.

We count how many values fall into each interval. At the same time, remember that if a unit has a feature value equal to the value of the upper limit of the interval, then it should be attributed to the next interval.

We build an interval series in the form of a table.

Determine the midpoints of the intervals.

For further analysis of the interval series, you will need to select a feature value for each interval. This feature value will be common for all units of observation that fall into this interval. Those. individual elements "lose" their individual characteristic values and they are assigned one common characteristic value. This general meaning is middle of the interval, which is denoted x" i .

Consider, using an example with the growth of children, how to build an interval series with equal intervals.

Initial data available.

90, 91, 92, 93, 94, 95, 96, 97, 98, 99 , 92, 93, 94, 95, 96, 98 , , 100, 101, 102, 103, 104, 105, 106, 107, 108, 109 , 100, 101, 102, 104 , 110, 112, 114, 116, 117, 120, 122, 123, 124, 129, 110, 111, 113, 115, 116, 117, 121, 125, 126, 127 , 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129 , 111, 113, 116, 127 , 123, 122, 130, 131, 132, 133, 134, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150 , 131, 133, 135, 136, 138, 139, 140, 141, 142, 143, 145, 146, 147, 148

In many cases, if the statistical population includes a large or, even more so, an infinite number of options, which is most often encountered with continuous variation, it is practically impossible and impractical to form a group of units for each option. In such cases, the association of statistical units into groups is possible only on the basis of the interval, i.e. such a group that has certain limits of the values of the varying attribute. These limits are indicated by two numbers indicating the upper and lower limits of each group. The use of intervals leads to the formation of an interval distribution series.

Interval rad is a variation series, the variants of which are presented in the form of intervals.

An interval series can be formed with equal and unequal intervals, while the choice of the principle for constructing this series depends mainly on the degree of representativeness and convenience of the statistical population. If the population is large enough (representative) in terms of the number of units and is completely homogeneous in its composition, then it is advisable to base the formation of an interval series on equality of intervals. Usually, using this principle, an interval series is formed for those populations where the range of variation is relatively small, i.e. the maximum and minimum options usually differ from each other several times. In this case, the value of equal intervals is calculated by the ratio of the range of variation of a characteristic to a given number of formed intervals. To determine equal And interval, the Sturgess formula can be used (usually with a small variation of interval characteristics and a large number of units in the statistical population):

where x i - equal interval value; X max, X min - maximum and minimum options in a statistical aggregate; n . - the number of units in the aggregate.

Example. It is advisable to calculate the size of an equal interval for the density of radioactive contamination with cesium - 137 in 100 settlements of the Krasnopolsky district of the Mogilev region, if it is known that the initial (minimum) option is equal to I km / km 2, the final ( maximum) - 65 ki / km 2. Using formula 5.1. we get:

Consequently, in order to form an interval series with equal intervals in terms of the density of cesium contamination - 137 settlements in the Krasnopolsky region, the size of the equal interval can be 8 ki/km 2 .

Under conditions of uneven distribution, i.e. when the maximum and minimum options are hundreds of times, when forming an interval series, you can apply the principle unequal intervals. Unequal intervals usually increase as we move to larger values of the characteristic.

The shape of the intervals can be closed or open. Closed It is customary to call intervals that have both lower and upper boundaries. Open intervals have only one boundary: in the first interval there is an upper boundary, in the last one there is a lower boundary.

It is advisable to evaluate interval series, especially with unequal intervals, taking into account distribution density, the simplest way to calculate which is the ratio of the local frequency (or frequency) to the size of the interval.

To practically form an interval series, you can use the table layout. 5.3.

Table 5.3. The procedure for forming an interval series of settlements in the Krasnopolsky region according to the density of radioactive contamination with cesium –137

The main advantage of the interval series is its maximum compactness. at the same time, in the interval distribution series, individual variants of the characteristic are hidden in the corresponding intervals

When graphically depicting an interval series in a system of rectangular coordinates, the upper boundaries of the intervals are plotted on the abscissa axis, and the local frequencies of the series are plotted on the ordinate axis. The graphical construction of an interval series differs from the construction of a distribution polygon in that each interval has lower and upper boundaries, and two abscissas correspond to one ordinate value. Therefore, on the graph of an interval series, not a point is marked, as in a polygon, but a line connecting two points. These horizontal lines are connected to each other by vertical lines and the figure of a stepped polygon is obtained, which is commonly called histogram distribution (Fig. 5.3).

When graphically constructing an interval series for a sufficiently large statistical population, the histogram approaches symmetrical form of distribution. In those cases where the statistical population is small, as a rule, asymmetrical bar chart.

In some cases, it is advisable to form a series of accumulated frequencies, i.e. cumulative row. A cumulative series can be formed on the basis of a discrete or interval distribution series. When graphically depicting a cumulative series in a system of rectangular coordinates, variants are plotted on the abscissa axis, and accumulated frequencies (frequencies) are plotted on the ordinate axis. The resulting curved line is usually called cumulative distribution (Fig. 5.4).

The formation and graphical representation of various types of variation series contributes to a simplified calculation of the main statistical characteristics, which are discussed in detail in topic 6, and helps to better understand the essence of the distribution laws of the statistical population. Analysis of a variation series acquires particular importance in cases where it is necessary to identify and trace the relationship between options and frequencies (frequencies). This dependence is manifested in the fact that the number of cases per option is in a certain way related to the size of this option, i.e. with increasing values of the varying characteristic, the frequencies (frequencies) of these values experience certain, systematic changes. This means that the numbers in the frequency (frequency) column do not fluctuate chaotically, but change in a certain direction, in a certain order and sequence.

If the frequencies show a certain systematicity in their changes, then this means that we are on the way to identifying a pattern. The system, order, sequence in changes in frequencies is a reflection of general causes, general conditions characteristic of the entire population.

It should not be assumed that the distribution pattern is always given in ready-made form. There are quite a lot of variation series in which the frequencies bizarrely jump, sometimes increasing, sometimes decreasing. In such cases, it is advisable to find out what kind of distribution the researcher is dealing with: either this distribution does not have any inherent patterns at all, or its nature has not yet been revealed: The first case is rare, but the second case is a fairly common and very widespread phenomenon.

Thus, when forming an interval series, the total number of statistical units may be small, and each interval contains a small number of variants (for example, 1-3 units). In such cases, one cannot count on the manifestation of any pattern. In order for a natural result to be obtained based on random observations, the law of large numbers must come into force, i.e. so that for each interval there would be not several, but tens and hundreds of statistical units. To this end, we must try to increase the number of observations as much as possible. This is the surest way to detect patterns in mass processes. If there is no real opportunity to increase the number of observations, then identifying a pattern can be achieved by reducing the number of intervals in the distribution series. By reducing the number of intervals in a variation series, the number of frequencies in each interval thereby increases. This means that the random fluctuations of each statistical unit are superimposed on each other, “smoothed out”, turning into a pattern.

The formation and construction of variation series allows us to obtain only a general, approximate picture of the distribution of the statistical population. For example, a histogram only in rough form expresses the relationship between the values of a characteristic and its frequencies (frequencies). Therefore, variation series are essentially only the basis for further, in-depth study of the internal regularity of the static distribution.

TEST QUESTIONS FOR TOPIC 5

1. What is variation? What causes variation in a trait in a statistical population?

2. What types of varying characteristics can occur in statistics?

3. What is a variation series? What types of variation series can there be?

4. What is a ranked series? What are its advantages and disadvantages?

5. What is a discrete series and what are its advantages and disadvantages?

6. What is the procedure for forming an interval series, what are its advantages and disadvantages?

7. What is a graphical representation of ranked, discrete, interval distribution series?

8. What is the cumulate of distribution and what does it characterize?

Math statistics- a branch of mathematics devoted to mathematical methods of processing, systematizing and using statistical data for scientific and practical conclusions.

3.1. BASIC CONCEPTS OF MATHEMATICAL STATISTICS

In medical and biological problems, it is often necessary to study the distribution of a particular characteristic for a very large number of individuals. This trait has different meanings for different individuals, so it is a random variable. For example, any therapeutic drug has different effectiveness when applied to different patients. However, in order to get an idea of the effectiveness of this drug, there is no need to apply it to everyone sick. It is possible to trace the results of using the drug to a relatively small group of patients and, based on the data obtained, identify the essential features (efficacy, contraindications) of the treatment process.

Population- a set of homogeneous elements characterized by some attribute to be studied. This sign is continuous random variable with distribution density f(x).

For example, if we are interested in the prevalence of a disease in a certain region, then the general population is the entire population of the region. If we want to find out the susceptibility of men and women to this disease separately, then we should consider two general populations.

To study the properties of a general population, a certain part of its elements is selected.

Sample- part of the general population selected for examination (treatment).

If this does not cause confusion, then a sample is called as a set of objects, selected for the survey, and totality

values the studied characteristic obtained during the examination. These values can be represented in several ways.

Simple statistical series - values of the characteristic being studied, recorded in the order in which they were obtained.

An example of a simple statistical series obtained by measuring the surface wave velocity (m/s) in the skin of the forehead in 20 patients is given in Table. 3.1.

Table 3.1.Simple statistical series

A simple statistical series is the main and most complete way of recording survey results. It can contain hundreds of elements. It is very difficult to take a look at such a totality at one glance. Therefore, large samples are usually divided into groups. To do this, the area of change in the characteristic is divided into several (N) intervals equal width and calculate the relative frequencies (n/n) of the attribute falling into these intervals. The width of each interval is:

The interval boundaries have the following meanings:

If any sample element is the boundary between two adjacent intervals, then it is classified as left interval. Data grouped in this way is called interval statistical series.

is a table that shows intervals of attribute values and the relative frequencies of occurrence of the attribute within these intervals.

In our case, we can form, for example, the following interval statistical series (N = 5, d= 4), tab. 3.2.

Table 3.2.Interval statistical series

Here, the interval 28-32 includes two values equal to 28 (Table 3.1), and the interval 32-36 includes values 32, 33, 34 and 35.

An interval statistical series can be depicted graphically. To do this, intervals of attribute values are plotted along the abscissa axis and on each of them, as on a base, a rectangle is built with a height equal to the relative frequency. The resulting bar chart is called histogram.

Rice. 3.1. bar chart

In the histogram, the statistical patterns of the distribution of the characteristic are visible quite clearly.

With a large sample size (several thousand) and small column widths, the shape of the histogram is close to the shape of the graph distribution density sign.

The number of histogram columns can be selected using the following formula:

Constructing a histogram manually is a long process. Therefore, computer programs have been developed to automatically construct them.

3.2. NUMERIC CHARACTERISTICS OF STATISTICAL SERIES

Many statistical procedures use sample estimates for the population expectation and variance (or MSE).

Sample mean(X) is the arithmetic mean of all elements of a simple statistical series:

For our example X= 37.05 (m/s).

The sample mean isthe bestgeneral average estimateM.

Sample variance s 2 equal to the sum of squared deviations of elements from the sample mean, divided by n- 1:

In our example, s 2 = 25.2 (m/s) 2.

Please note that when calculating the sample variance, the denominator of the formula is not the sample size n, but n-1. This is due to the fact that when calculating deviations in formula (3.3), instead of the unknown mathematical expectation, its estimate is used - sample mean.

Sample variance is the best estimation of general variance (σ 2).

Sample standard deviation(s) is the square root of the sample variance:

For our example s= 5.02 (m/s).

selective root mean square deviation is the best estimate of the general standard deviation (σ).

With an unlimited increase in sample size, all sample characteristics tend to the corresponding characteristics of the general population.

Computer formulas are used to calculate sample characteristics. In Excel, these calculations perform the statistical functions AVERAGE, VARIANCE. STDEV.

3.3. INTERVAL ASSESSMENT

All sample characteristics are random variables. This means that for another sample of the same size, the values of the sample characteristics will be different. Thus, selective

characteristics are only estimates relevant characteristics of the population.

The disadvantages of selective assessment are compensated by interval estimation, representing numeric interval inside which with a given probability R d the true value of the estimated parameter is found.

Let U r - some parameter of the general population (general mean, general variance, etc.).

Interval estimation parameter U r is called the interval (U 1 , U 2), satisfying the condition:

P(U < Ur < U2) = Рд. (3.5)

Probability R d called confidence probability.

Confidence probability Pd - the probability that the true value of the estimated quantity is inside the specified interval.

In this case, the interval (U 1 , U 2) called confidence interval for the parameter being estimated.

Often, instead of the confidence probability, the associated value α = 1 - Р d is used, which is called level of significance.

Significance level is the probability that the true value of the estimated parameter is outside confidence interval.

Sometimes α and P d are expressed as percentages, for example, 5% instead of 0.05 and 95% instead of 0.95.

In interval estimation, first select the appropriate confidence probability(usually 0.95 or 0.99), and then find the appropriate range of values for the parameter being estimated.

Let us note some general properties of interval estimates.

1. The lower the level of significance (the more R d), the wider the interval estimate. So, if at a significance level of 0.05 the interval estimate of the general mean is 34.7< M< 39,4, то для уровня 0,01 она будет гораздо шире: 33,85 < M< 40,25.

2. The larger the sample size n, the narrower the interval estimate with the selected significance level. Let, for example, 5 be the percentage estimate of the general average (β = 0.05) obtained from a sample of 20 elements, then 34.7< M< 39,4.

By increasing the sample size to 80, we get a more accurate estimate at the same significance level: 35.5< M< 38,6.

In general, the construction of reliable confidence estimates requires knowledge of the law according to which the estimated random attribute is distributed in the population. Let's look at how an interval estimate is constructed general average characteristic that is distributed in the population according to normal law.

3.4. INTERVAL ESTIMATION OF THE GENERAL AVERAGE FOR THE NORMAL DISTRIBUTION LAW

The construction of an interval estimate of the general average M for a population with a normal distribution law is based on the following property. For sampling volume n attitude

obeys the Student distribution with the number of degrees of freedom ν = n- 1.

Here X- sample mean, and s- selective standard deviation.

Using Student distribution tables or their computer equivalent, you can find a boundary value such that, with a given confidence probability, the following inequality holds:

This inequality corresponds to the inequality for M:

Where ε - half-width of the confidence interval.

Thus, the construction of a confidence interval for M is carried out in the following sequence.

1. Select a confidence probability Р d (usually 0.95 or 0.99) and for it, using the Student distribution table, find the parameter t

2. Calculate the half-width of the confidence interval ε:

3. Obtain an interval estimate of the general average with the selected confidence probability:

Briefly it is written like this:

Computer procedures have been developed to find interval estimates.

Let us explain how to use the Student distribution table. This table has two “entrances”: the left column, called the number of degrees of freedom ν = n- 1, and the top line is the significance level α. At the intersection of the corresponding row and column, find the Student coefficient t.

Let's apply this method to our sample. A fragment of the Student distribution table is presented below.

Table 3.3. Fragment of the Student distribution table

A simple statistical series for a sample of 20 people (n= 20, ν =19) is presented in table. 3.1. For this series, calculations using formulas (3.1-3.3) give: X= 37,05; s= 5,02.

Let's choose α = 0.05 (Р d = 0.95). At the intersection of row “19” and column “0.05” we find t= 2,09.

Let us calculate the accuracy of the estimate using formula (3.6): ε = 2.09?5.02/λ /20 = 2.34.

Let's construct an interval estimate: with a probability of 95%, the unknown general mean satisfies the inequality:

37,05 - 2,34 < M< 37,05 + 2,34, или M= 37.05 ± 2.34 (m/s), R d = 0.95.

3.5. METHODS FOR TESTING STATISTICAL HYPOTHESES

Statistical hypotheses

Before formulating what a statistical hypothesis is, consider the following example.

To compare two methods of treating a certain disease, two groups of patients of 20 people each were selected and treated using these methods. For each patient it was recorded number of procedures, after which a positive effect was achieved. Based on these data, sample means (X), sample variances were found for each group (s 2) and sample standard deviations (s).

The results are presented in table. 3.4.

Table 3.4

The number of procedures required to obtain a positive effect is a random variable, all information about which is currently contained in the given sample.

From the table 3.4 shows that the sample average in the first group is less than in the second. Does this mean that the same relationship holds for general averages: M 1< М 2 ? Достаточно ли статистических данных для такого вывода? Ответы на эти вопросы и дает statistical testing of hypotheses.

Statistical hypothesis- it is an assumption about the properties of populations.

We will consider hypotheses about the properties two general populations.

If populations have famous, identical distribution of the value being estimated, and the assumptions concern the values some parameter of this distribution, then the hypotheses are called parametric. For example, samples are drawn from populations with normal law distribution and equal variance. Need to find out are they the same general averages of these populations.

If nothing is known about the laws of distribution of general populations, then hypotheses about their properties are called nonparametric. For example, are they the same laws of distribution of the general populations from which the samples are drawn.

Null and alternative hypotheses.

The task of testing hypotheses. Significance level

Let's get acquainted with the terminology used when testing hypotheses.

H 0 - null hypothesis (skeptic's hypothesis) is a hypothesis about the absence of differences between compared samples. The skeptic believes that the differences between sample estimates obtained from research results are random;

H 1- alternative hypothesis (optimist hypothesis) is a hypothesis about the presence of differences between the compared samples. An optimist believes that differences between sample estimates are caused by objective reasons and correspond to differences in general populations.

Testing statistical hypotheses is feasible only when it is possible to construct some size(criterion), the distribution law of which in case of fairness H 0 famous. Then for this quantity we can specify confidence interval, into which with a given probability R d its value falls. This interval is called critical area. If the criterion value falls into the critical region, then the hypothesis is accepted N 0. Otherwise, hypothesis H 1 is accepted.

In medical research, P d = 0.95 or P d = 0.99 are used. These values correspond significance levelsα = 0.05 or α = 0.01.

When testing statistical hypotheseslevel of significance(α) is the probability of rejecting the null hypothesis when it is true.

Please note that, at its core, the hypothesis testing procedure is aimed at detecting differences and not to confirm their absence. When the criterion value goes beyond the critical region, we can say with a pure heart to the “skeptic” - well, what else do you want?! If there were no differences, then with a probability of 95% (or 99%) the calculated value would be within the specified limits. But no!..

Well, if the value of the criterion falls into the critical region, then there is no reason to believe that the hypothesis H 0 is correct. This most likely points to one of two possible reasons.

1. Sample sizes are not large enough to detect differences. It is likely that continued experimentation will bring success.

2. There are differences. But they are so small that they have no practical significance. In this case, continuing the experiments does not make sense.

Let's move on to consider some statistical hypotheses used in medical research.

3.6. TESTING HYPOTHESES ABOUT EQUALITY OF VARIANCES, FISCHER'S F-CRITERION

In some clinical studies, the positive effect is evidenced not so much magnitude of the parameter being studied, how much of it stabilization, reducing its fluctuations. In this case, the question arises about comparing two general variances based on the results of a sample survey. This problem can be solved using Fisher's test.

Formulation of the problem

normal law distributions. Sample sizes -

n 1 And n2, A sample variances equal s 1 and s 2 2 general variances.

Testable hypotheses:

H 0- general variances are the same;

H 1- general variances are different.

Shown if samples are drawn from populations with normal law distribution, then if the hypothesis is true H 0 the ratio of sample variances follows the Fisher distribution. Therefore, as a criterion for testing the validity H 0 the value is taken F, calculated by the formula:

Where s 1 and s 2 - sample variances.

This ratio obeys the Fisher distribution with the number of degrees of freedom of the numerator ν 1 = n 1- 1 and the number of degrees of freedom of the denominator ν 2 = n 2 - 1. The boundaries of the critical region are found according to the tables of Fisher's distribution or using the computer function BRAP.

For the example presented in table. 3.4, we get: ν 1 = ν 2 = 20 - 1 = 19; F= 2.16/4.05 = 0.53. At α = 0.05, the boundaries of the critical region are respectively: = 0.40, = 2.53.

The criterion value falls into the critical region, so the hypothesis is accepted H 0: general sample variances are the same.

3.7. TESTING HYPOTHESES REGARDING EQUALITY OF MEANS, STUDENT t-CRITERION

Comparison task average two general populations arises when it is the magnitude the characteristic being studied. For example, when comparing the duration of treatment with two different methods or the number of complications arising from their use. In this case, you can use the Student's t-test.

Formulation of the problem

Two samples (X 1) and (X 2) were obtained, extracted from general populations with normal law distribution and identical variances. Sample sizes - n 1 and n 2, sample means are equal to X 1 and X 2, and sample variances- s 1 2 and s 2 2 respectively. Need to compare general averages.

Testable hypotheses:

H 0- general averages are the same;

H 1- general averages are different.

It is shown that if the hypothesis is true H 0 t value calculated by the formula:

distributed according to Student's law with the number of degrees of freedom ν = ν 1 + + ν2 - 2.

Here where ν 1 = n 1 - 1 - number of degrees of freedom for the first sample; ν 2 = n 2 - 1 - number of degrees of freedom for the second sample.

The boundaries of the critical region are found using t-distribution tables or using the computer function STUDRIST. The Student distribution is symmetrical about zero, so the left and right boundaries of the critical region are identical in magnitude and opposite in sign: -and

For the example presented in table. 3.4, we get:

ν 1 = ν 2 = 20 - 1 = 19; ν = 38, t= -2.51. At α = 0.05 = 2.02.

The criterion value goes beyond the left border of the critical region, so we accept the hypothesis H 1: general averages are different. At the same time, the average of the general population first sample LESS.

Applicability of Student's t-test

Student's t-test applies only to samples from normal aggregates with the same general variances. If at least one of the conditions is violated, then the applicability of the criterion is doubtful. The requirement of normality of the general population is usually ignored, citing central limit theorem. Indeed, the difference between sample means in the numerator (3.10) can be considered normally distributed for ν > 30. But the question of equality of variances cannot be verified, and references to the fact that the Fisher test did not detect differences cannot be taken into account. However, the t-test is widely used to detect differences in population means, although without sufficient evidence.

Below is discussed nonparametric criterion, which is successfully used for the same purposes and which does not require any normality, neither equality of variances.

3.8. NONPARAMETRIC COMPARISON OF TWO SAMPLES: MANN-WHITNEY CRITERION

Nonparametric tests are designed to detect differences in the distribution laws of two populations. Criteria that are sensitive to differences in general average, called criteria shift Criteria that are sensitive to differences in general dispersions, called criteria scale. The Mann-Whitney test refers to the criteria shift and is used to detect differences in the means of two populations, samples from which are presented in ranking scale. The measured characteristics are located on this scale in ascending order, and then numbered with integers 1, 2... These numbers are called ranks. Equal quantities are assigned equal ranks. It is not the value of the attribute itself that matters, but only ordinal place which it ranks among other quantities.

In table 3.5. the first group from Table 3.4 is presented in expanded form (line 1), ranked (line 2), and then the ranks of identical values are replaced by arithmetic averages. For example, items 4 and 4 in the first row were given ranks 2 and 3, which were then replaced with the same values of 2.5.

Table 3.5

Formulation of the problem

Independent samples (X 1) And (X 2) extracted from general populations with unknown distribution laws. Sample sizes n 1 And n 2 respectively. The values of sample elements are presented in ranking scale. It is necessary to check whether these general populations differ from each other?

Testable hypotheses:

H 0- samples belong to the same general population; H 1- samples belong to different general populations.

To test such hypotheses, the (/-Mann-Whitney test is used.

First, a combined sample (X) is compiled from the two samples, the elements of which are ranked. Then the sum of the ranks corresponding to the elements of the first sample is found. This amount is the criterion for testing hypotheses.

U= Sum of ranks of the first sample. (3.11)

For independent samples whose volumes are greater than 20, the value U obeys the normal distribution, the mathematical expectation and standard deviation of which are equal to:

Therefore, the boundaries of the critical region are found according to normal distribution tables.

For the example presented in table. 3.4, we get: ν 1 = ν 2 = 20 - 1 = 19, U= 339, μ = 410, σ = 37. For α = 0.05 we get: left = 338 and right = 482.

The value of the criterion goes beyond the left border of the critical region, therefore hypothesis H 1 is accepted: general populations have different distribution laws. At the same time, the population average first sample LESS.

When constructing an interval distribution series, three questions are resolved:

1. How many intervals should I take?
2. What is the length of the intervals?
3. What is the procedure for including population units within the boundaries of intervals?
1. Number of intervals can be determined by Sturgess formula:

2. Interval length, or interval step, usually determined by the formula

Where R- range of variation.

3. The order of inclusion of population units within the boundaries of the interval

may be different, but when constructing an interval series, the distribution must be strictly defined.

For example, this: [), in which population units are included in the lower boundaries, but are not included in the upper boundaries, but are transferred to the next interval. The exception to this rule is the last interval, the upper limit of which includes the last number of the ranked series.

The interval boundaries are:

closed - with two extreme values of the attribute;
open - with one extreme value of the attribute (before such and such a number or over such and such a number).

In order to assimilate the theoretical material, we introduce background information for solutions end-to-end task.

There are conditional data on the average number of sales managers, the quantity of similar goods sold by them, the individual market price for this product, as well as the sales volume of 30 companies in one of the regions of the Russian Federation in the first quarter of the reporting year (Table 2.1).

Table 2.1

Initial information for a cross-cutting task

Number managers,	Price, thousand rubles	Sales volume, million rubles.

Number managers,	Quantity of goods sold, pcs.	Price, thousand rubles	Sales volume, million rubles.

Based on the initial information, as well as additional information, we will set up individual tasks. Then we will present the methodology for solving them and the solutions themselves.

Cross-cutting task. Task 2.1

Using the initial data from table. 2.1 required construct a discrete series of distribution of firms by quantity of goods sold (Table 2.2).

Solution:

Table 2.2

Discrete series of distribution of firms by quantity of goods sold in one of the regions of the Russian Federation in the first quarter of the reporting year

Cross-cutting task. Task 2.2

required construct a ranked series of 30 firms according to the average number of managers.

Solution:

15; 17; 18; 20; 20; 20; 22; 22; 24; 25; 25; 25; 27; 27; 27; 28; 29; 30; 32; 32; 33; 33; 33; 34; 35; 35; 38; 39; 39; 45.

Cross-cutting task. Task 2.3

Using the initial data from table. 2.1, required:

1. Construct an interval series of distribution of firms by number of managers.
2. Calculate the frequencies of the distribution series of firms.
3. Draw conclusions.

Solution:

Let's calculate using the Sturgess formula (2.5) number of intervals:

Thus, we take 6 intervals (groups).

Interval length, or interval step, calculate using the formula

Note. The order of inclusion of population units in the boundaries of the interval is as follows: I), in which population units are included in the lower boundaries, but are not included in the upper boundaries, but are transferred to the next interval. The exception to this rule is the last interval I ], the upper limit of which includes the last number of the ranked series.

We build an interval series (Table 2.3).

Interval series of distribution of firms and the average number of managers in one of the regions of the Russian Federation in the first quarter of the reporting year

Conclusion. The largest group of firms is the group with an average number of managers of 25-30 people, which includes 8 firms (27%); The smallest group with an average number of managers of 40-45 people includes only one company (3%).

Using the initial data from table. 2.1, as well as an interval series of distribution of firms by number of managers (Table 2.3), required build an analytical grouping of the relationship between the number of managers and the sales volume of firms and, based on it, draw a conclusion about the presence (or absence) of a relationship between these characteristics.

Solution:

Analytical grouping is based on factor characteristics. In our problem, the factor characteristic (x) is the number of managers, and the resultant characteristic (y) is the sales volume (Table 2.4).

Let's build now analytical grouping(Table 2.5).

Conclusion. Based on the data of the constructed analytical grouping, we can say that with an increase in the number of sales managers, the average sales volume of the company in the group also increases, which indicates the presence of a direct connection between these characteristics.

Table 2.4

Auxiliary table for constructing an analytical grouping

Number of managers, people,	Company number	Sales volume, million rubles, y











		" = 59 f = 9.97








		I-™ 4 - Yu.22







		74 '25 1PY1 U4 = 7 = 10,61










			at = ’ =10,31 30

Table 2.5

Dependence of sales volumes on the number of company managers in one of the regions of the Russian Federation in the first quarter of the reporting year

CONTROL QUESTIONS

1. What is the essence of statistical observation?
2. Name the stages of statistical observation.
3. What are the organizational forms of statistical observation?
4. Name the types of statistical observation.
5. What is a statistical summary?
6. Name the types of statistical reports.
7. What is statistical grouping?
8. Name the types of statistical groupings.
9. What is a distribution series?
10. Name the structural elements of the distribution row.
11. What is the procedure for constructing a distribution series?