Residual variance formula econometrics. Solution and analysis. Assessment of model accuracy, or assessment of approximation

Dispersion in statistics is found as the individual values ​​of the characteristic squared from . Depending on the initial data, it is determined using the simple and weighted variance formulas:

1. (for ungrouped data) is calculated using the formula:

2. Weighted variance (for variation series):

where n is frequency (repeatability of factor X)

An example of finding variance

This page describes a standard example of finding variance, you can also look at other problems for finding it

Example 1. The following data is available for a group of 20 students correspondence department. It is necessary to construct an interval series of the distribution of the characteristic, calculate the average value of the characteristic and study its dispersion

Let's build an interval grouping. Let's determine the range of the interval using the formula:

where X max is the maximum value of the grouping characteristic;
X min – minimum value of the grouping characteristic;
n – number of intervals:

We accept n=5. The step is: h = (192 - 159)/ 5 = 6.6

Let's create an interval grouping

For further calculations, we will build an auxiliary table:

X'i is the middle of the interval. (for example, the middle of the interval 159 – 165.6 = 162.3)

We determine the average height of students using the weighted arithmetic average formula:

Let's determine the variance using the formula:

The dispersion formula can be transformed as follows:

From this formula it follows that variance is equal to the difference between the average of the squares of the options and the square and the average.

Variance in variation series with equal intervals using the method of moments can be calculated in the following way using the second property of dispersion (dividing all options by the value of the interval). Determining variance, calculated using the method of moments, using the following formula is less labor intensive:

where i is the value of the interval;
A is a conventional zero, for which it is convenient to use the middle of the interval with the highest frequency;
m1 is the square of the first order moment;
m2 - moment of second order

(if in a statistical population a characteristic changes in such a way that there are only two mutually exclusive options, then such variability is called alternative) can be calculated using the formula:

Substituting q = 1- p into this dispersion formula, we get:

Types of variance

Total variance measures the variation of a characteristic across the entire population as a whole under the influence of all factors that cause this variation. It is equal to the mean square of the deviations of individual values ​​of a characteristic x from the overall mean value of x and can be defined as simple variance or weighted variance.

characterizes random variation, i.e. part of the variation that is due to the influence of unaccounted factors and does not depend on the factor-attribute that forms the basis of the group. Such dispersion is equal to the mean square of the deviations of individual values ​​of the attribute within group X from the arithmetic mean of the group and can be calculated as simple dispersion or as weighted dispersion.

Thus, within-group variance measures variation of a trait within a group and is determined by the formula:

where xi is the group average;
ni is the number of units in the group.

For example, intragroup variances that need to be determined in the task of studying the influence of workers’ qualifications on the level of labor productivity in a workshop show variations in output in each group caused by all possible factors (technical condition of equipment, availability of tools and materials, age of workers, labor intensity, etc. .), except for differences in qualification category (within a group all workers have the same qualifications).

The average of the within-group variances reflects random, i.e., that part of the variation that occurred under the influence of all other factors, with the exception of the grouping factor. It is calculated using the formula:

Characterizes the systematic variation of the resulting characteristic, which is due to the influence of the factor-sign that forms the basis of the group. It is equal to the mean square of the deviations of the group means from the overall mean. Intergroup variance is calculated using the formula:

The rule for adding variance in statistics

According to rule for adding variances the total variance is equal to the sum of the average of the within-group and between-group variances:

The meaning of this rule is that the total variance that arises under the influence of all factors is equal to the sum of the variances that arise under the influence of all other factors and the variance that arises due to the grouping factor.

Using the formula for adding variances, you can determine the third unknown variance from two known variances, and also judge the strength of the influence of the grouping characteristic.

Dispersion properties

1. If all values ​​of a characteristic are reduced (increased) by the same constant amount, then the dispersion will not change.
2. If all values ​​of a characteristic are reduced (increased) by the same number of times n, then the variance will correspondingly decrease (increase) by n^2 times.

1. The essence of correlation-regression analysis and its tasks.

2. Definition of regression and its types.

3. Features of the model specification. Reasons for the existence of a random variable.

4. Methods for selecting paired regression.

5. Method least squares.

6. Indicators for measuring tightness and strength of connection.

7. Estimates of statistical significance.

8. Predicted value of the variable y and confidence intervals of the forecast.

1. The essence of correlation-regression analysis and its tasks. Economic phenomena, being very diverse, are characterized by many features that reflect certain properties these processes and phenomena and are subject to interdependent changes. In some cases, the relationship between the characteristics turns out to be very close (for example, an employee’s hourly output and his salary), while in other cases such a relationship is not expressed at all or is extremely weak (for example, the gender of students and their academic performance). The closer the connection between these features, the more accurate the decisions made.

There are two types of dependencies between phenomena and their characteristics:

    functional (deterministic, causal) dependence . It is specified in the form of a formula that associates each value of one variable with a strictly defined value of another variable (the influence of random factors is neglected). In other words, functional dependence is a relationship in which each value of the independent variable x corresponds to a precisely defined value of the dependent variable y. In economics, functional relationships between variables are exceptions to the general rule;

    statistical (stochastic, non-deterministic) dependence is a connection of variables, which is influenced by random factors, i.e. This is a relationship in which each value of the independent variable x corresponds to a set of values ​​of the dependent variable y, and it is not known in advance what value y will take.

A special case of statistical dependence is correlation dependence.

Correlation dependence is a relationship in which each value of the independent variable x corresponds to a certain mathematical expectation (average value) of the dependent variable y.

The correlation dependence is an “incomplete” dependence, which does not appear in each individual case, but only in average values ​​with sufficient large number cases. For example, it is known that improving the qualifications of an employee leads to an increase in labor productivity. This statement is often confirmed in practice, but does not mean that two or more workers of the same category/level engaged in a similar process will have the same labor productivity.

Correlation dependence is studied using the methods of correlation and regression analysis.

Correlation and regression analysis allows you to establish the closeness, direction of the connection and the form of this connection between variables, i.e. its analytical expression.

The main task of correlation analysis consists of quantitatively determining the closeness of the connection between two characteristics in a pairwise connection and between effective and several factor characteristics in a multifactorial connection and statistically assessing the reliability of the established connection.

2. Definition of regression and its types. Regression analysis is the main mathematical and statistical tool in econometrics. Regression It is customary to call the dependence of the average value of a quantity (y) on some other quantity or on several quantities (x i).

Depending on the number of factors included in the regression equation, it is customary to distinguish between simple (paired) and multiple regression.

Simple (pairwise) regression is a model where the average value of the dependent (explained) variable y is considered as a function of one independent (explanatory) variable x. Implicitly, pairwise regression is a model of the form:

Explicitly:

,

where a and b are estimates of regression coefficients.

Multiple regression is a model where the average value of the dependent (explained) variable y is considered as a function of several independent (explanatory) variables x 1, x 2, ... x n. Implicitly, pairwise regression is a model of the form:

.

Explicitly:

where a and b 1, b 2, b n are estimates of regression coefficients.

An example of such a model is the dependence of an employee’s salary on his age, education, qualifications, length of service, industry, etc.

Regarding the form of dependence, there are:

      linear regression;

      nonlinear regression, which assumes the existence of nonlinear relationships between factors expressed by the corresponding nonlinear function.

Often, models that are nonlinear in appearance can be reduced to a linear form, which allows them to be classified as linear. 3. Features of the model specification. Reasons for the existence of a random variable. Any econometric study begins with , i.e. from the formulation of the type of model, based on the corresponding theory of relationships between variables.

First of all, from the entire range of factors influencing the effective attribute, it is necessary to identify the most significantly influencing factors. Pairwise regression is sufficient if there is a dominant factor, which is used as an explanatory variable. A simple regression equation characterizes the relationship between two variables, which manifests itself as a certain pattern only on average for the totality of observations. In the regression equation, the correlation relationship is represented in the form of a functional dependence, expressed by the corresponding mathematical function. In almost every individual case, the value y consists of two terms:

,

where y is the actual value of the resulting characteristic;

– theoretical value of the resultant characteristic, found based on the regression equation;

random value, characterizing the deviation of the real value of the resulting characteristic from the theoretical one found using the regression equation.

Random value also called disturbance. It includes the influence of factors not taken into account in the model, random errors and measurement features. The presence of a random variable in the model is generated by three sources:

    model specification,

    selective nature of the source data,

    features of measuring variables.

Specification errors will include not only the incorrect choice of a particular mathematical function, but also the underestimation of any significant factor in the regression equation (using paired regression instead of multiple).

Along with specification errors, sampling errors may occur, since the researcher most often deals with sample data when establishing patterns of relationships between characteristics. Sampling errors also occur due to the heterogeneity of data in the original statistical population, which usually happens when studying economic processes. If the population is heterogeneous, then the regression equation has no practical meaning. To obtain a good result, units with anomalous values ​​of the studied characteristics are usually excluded from the population. Again, the regression results represent sample characteristics. Source data

However, the greatest danger in the practical use of regression methods is measurement errors. If specification errors can be reduced by changing the form of the model (a type of mathematical formula), and sampling errors can be reduced by increasing the volume of initial data, then measurement errors practically nullify all efforts to quantify the relationship between characteristics.

4. Methods for selecting paired regression. Assuming that measurement errors are minimized, the focus of econometric research is on model specification errors. In pairwise regression, choosing the type of mathematical function
can be done in three ways:

    graphic;

    analytical, i.e. based on the theory of the relationship being studied;

    experimental.

When studying the relationship between two characteristics graphic method choosing the type of regression equation is quite clear. It is based on the correlation field. Basic types of curves used in quantifying relationships




Class mathematical functions to describe the relationship between two variables is quite wide; other types of curves are also used.

Analytical method the choice of the type of regression equation is based on the study of the material nature of the connection of the characteristics under study, as well as a visual assessment of the nature of the connection. Those. if we are talking about the Laffer curve, showing the relationship between tax progressivity and budget revenues, then we are talking about a parabolic curve, and in microanalysis, isoquants are hyperbolas.

Econometrics is a science that quantifies relationships economic phenomena and processes. At the moment, solutions to the following econometrics problems are available online:

Correlation-regression method of analysis

Nonparametric measures of association

Heteroscedasticity of the random component

Autocorrelation

  1. Autocorrelation of time series levels. Testing for autocorrelation with construction of a correlogram;

Econometric methods for conducting expert research

  1. Using the analysis of variance method, test the null hypothesis about the influence of a factor on the quality of an object.

The resulting solution is presented in Word format. Immediately after the solution there is a link to download the template in Excel, which makes it possible to check all the obtained indicators. If the task requires a solution in Excel, then you can use statistical functions in Excel.

Time Series Components

  1. The Analytical Smoothing service can be used for analytical smoothing of a time series (along a straight line) and for finding the parameters of the trend equation. To do this, you need to specify the amount of source data. If there is a lot of data, you can paste it from Excel.
  2. Calculation of trend equation parameters.
    When choosing the type of trend function, you can use the finite difference method. If the general tendency is expressed by a second-order parabola, then we obtain constant finite differences of the second order. If the growth rates are approximately constant, then an exponential function is used for leveling.
    When choosing the form of an equation, you should proceed from the amount of available information. The more parameters the equation contains, the more observations there should be with the same degree of estimation reliability.
  3. Smoothing using the moving average method. Using

Let's assume that we have found these estimates and we can write the equation:

ŷ = a + bX,

Where A- regression constant, the point of intersection of the regression line with the axis OY;

b- regression coefficient, the slope of the regression line characterizing the relationship DY¤DX;

ŷ - theoretical value of the explained variable.

As is known in pairwise regression, the choice of the type mathematical model can be carried out in three ways:

1. Graphic.

2. Analytical.

3. Experimental.

A graphical method can be used to select a function that describes the observed values. The initial data is plotted on coordinate plane. The values ​​of the factor characteristic are plotted on the abscissa axis, and the values ​​of the resulting characteristic are plotted on the ordinate axis. The location of the dots will show the approximate shape of the connection. As a rule, this relationship is curvilinear. If the curvature of this line is small, then we can accept the hypothesis of the existence of a linear connection.

Let us depict the consumption function as a scatter diagram. To do this, in the coordinate system, we plot the value of income on the abscissa axis, and on the ordinate axis, the costs of consuming a conditional product. The location of the points corresponding to the sets of values ​​“income - consumption expenditure” will show the approximate form of the relationship (Figure 1).

Visually, according to the diagram, it is almost never possible to unambiguously name the best dependence.

Let's move on to evaluating the parameters of the selected function a And b least squares method.

The estimation problem can be reduced to the “classical” problem of finding the minimum. The variables are now grades A And b unknown parameters of the proposed connection at And X. To find the smallest value of any function, you first need to find the first order partial derivatives. Then equate each of them to zero and resolve the resulting system of equations with respect to the variables. In our case, such a function is the sum of squared deviations - S, and the variables are A And b. That is, we must find = 0 and = 0 and resolve the resulting system of equations with respect to A And b.

Let us derive parameter estimates using the least squares method, assuming that the coupling equation has the form ŷ = a + bX. Then the function S looks like

. Differentiating the function S By A, we obtain the first normal equation by differentiating with respect to b- second normal equation. , ,

After appropriate transformations we get:

(*)

There are simplified rules for constructing a system of normal equations. Let's apply them to linear function:

1) Multiply each term of the equation ŷ = a + bX by the coefficient for the first parameter ( A), that is, by one.

2) Before each variable we put a summation sign.

3) Multiply the free term of the equation by n.

4) We obtain the first normal equation

5) Multiply each term of the original equation by the coefficient of the second parameter ( b), that is, on X.

6) Before each variable we put a summation sign.

7) We obtain the second normal equation

Using these rules, a system of normal equations is compiled for any linear function. The rules were first formulated by the English economist R. Pearl.

The parameters of the equations are calculated using the following formulas:

, ,

Let's build, using the initial data in table 1, a system of normal equations (*) and solve it with respect to the unknowns A And b:


1677=11*a+4950*ba = -3309

790 400=4950*a+2 502 500*bb = 7.6923

The regression equation is:

ŷ = -3309 + 7.6923 x ,

Let's compare the actual and estimated costs of consumption of product A (Table 2).

Table 2 Comparison of actual and estimated values ​​of expenses for consumption of goods A with a linear relationship:

Group number

Consumption expenses

goods A

Deviation of actual expenses from calculated ones

actual(s)

settlement

absolute

(y – ŷ)

1 120 -1770,54 1890,54
2 129 -1385,92 1514,92
3 135 -1001,31 1136,31
4 140 -616,45 756,45
5 145 -232,08 377,08
6 151 152,53 -1,53
7 155 537,15 -382,15
8 160 921,76 -761,76
9 171 1306,38 -1135,38
10 182 1690,99 -1508,99
11 189 2075,61 -1886,61
Total - - 0

Let's plot the resulting function ŷ and a scatterplot using actual values ​​(y) and calculated values ​​( ŷ) .

The calculated values ​​deviate from the actual ones due to the fact that the relationship between the characteristics is correlational.

The correlation coefficient is used as a measure of the closeness of the relationship:

=

We obtain, using the initial data from Table 1:

σ x =158;

σ y = 20,76;

r = 0,990.

The linear correlation coefficient can take any value ranging from minus 1 to plus 1. The closer the correlation coefficient in absolute value is to 1, the closer the relationship between the characteristics. The sign of the linear correlation coefficient indicates the direction of the relationship - the direct relationship corresponds to a plus sign, and the inverse relationship corresponds to a minus sign.

Conclusion: relationship between values X and corresponding values at

close, direct dependence.

In our example d = 0,9801

This means that changes in product costs A can be 98.01% explained by changes in income.

The remaining 1.99% may result from:

1) insufficiently well-chosen form of communication;

2) the influence of any other unaccounted factors on the dependent variable.

Statistical testing of hypotheses.

We put forward a null hypothesis that the regression coefficient is statistically insignificant:

H 0 : b = 0.

The statistical significance of the regression coefficient is checked using t-Student's t-test. To do this, first determine the residual sum of squares

s 2 ost= å (y i – ŷ i) 2

s 2 ost = 1,3689.

and its standard deviation

s = 0,39. se ( b ) = 0,018.

Actual value t-Student's test for the regression coefficient:

.

t b = 427,35.

The value |t b |>t cr (t cr =2.26 for 95% significance level) allows us to draw a conclusion about the regression coefficient being different from zero (at the corresponding significance level) and, therefore, about the presence of an influence (connection) X And u.

Conclusion: actual value t-Student's t-test exceeds the table value, which means the null hypothesis is rejected and with a probability of 95% the alternative hypothesis about the statistical significance of the regression coefficient is accepted.

[b– t cr *se( b), b+ t cr *se( b)]- 95% confidence interval for b.

The confidence interval covers the true value of the parameter b with a given probability (in in this case 95%).

7,6516 < b < 7,7329.

Let's move on to checking the statistical significance of the correlation and determination coefficients:

r = 0,990;

d = r 2 = 0,9801.

We put forward a null hypothesis that the regression equation as a whole is statistically insignificant:

H 0 : r 2 = 0.

The assessment of the statistical significance of the constructed regression model as a whole is carried out using F-Fisher criterion. Actual value F-criteria for a paired regression equation linear in parameters is defined as:

where s 2 factor is the dispersion for theoretical values ŷ (variation explained);

s 2 rest - residual sum of squares;

r 2 - coefficient of determination.

Actual value F-Fisher criterion:

F f = 443,26

Conclusion: we reject the null hypothesis and, with a probability of 95%, accept the alternative hypothesis about the statistical significance of the regression equation.

    Correlation dependence between the factor x (average per capita subsistence level per day of one able-bodied person) and the resulting characteristic y (average daily wage). Parameters of the linear regression equation, economic interpretation of the regression coefficient.

y=f(x)+E ,y t =f(x) – theoretical function, E=y- y t

y t =a+bx – correlation dependence of the average daily wage (y) on the average per capita subsistence level per day of one able-bodied person (x)

a+b =

a +b =

b=
- regression coefficient.

It shows how many units the average wage (Y) changes when the per capita subsistence level per day of one able-bodied person (X) increases by 1 unit.

b=
= 0,937837482

This means that with an increase in the average per capita subsistence level per day of one able-bodied person (x) by 1 unit, the average daily wage will increase by an average of 0.937 units.

a= -b , a=135.4166667-0.937837482 86.75=54.05926511

3) Coefficient of variation

The coefficient of variation shows what proportion of the average value of SV is its average spread.

υ x = δх/x = 0.144982838, υ y = δy/y = 0.105751299

4) Correlation coefficient

The correlation coefficient is used to assess the closeness of the linear relationship between the average per capita subsistence level per day of one able-bodied person and the average daily wage.

rxy = b δх/δy = 0.823674909 because rxy ˃0 , then the correlation between the variables is called direct

All this shows the dependence of the average daily wage on the average per capita subsistence level per day of one able-bodied person.

5) Coefficient of determination

The coefficient of determination is used to assess the quality of the fit of linear regression equations.

The coefficient of determination characterizes the proportion of the variance of the effective attribute Y (average daily wage) explained by regression in the total variance of the effective attribute.

R 2 xy = (∑(y t - y avg) 2) / (∑(y - y avg) 2) = 0.678440355, 0.5< R 2 < 0,7 ,

This means that the strength of the connection is noticeable, close to high, and the regression equation is well chosen.

6) Assessment of model accuracy, or assessment of approximation.

=1/n ∑ ׀(y i - y t)/y i ׀ 100% - average approximation error.

An error of less than 5-7% indicates a good fit of the model.

If the error is greater than 10%, you should consider choosing a different type of model equation.

Approximation error =0.015379395 100%=1.53%, which indicates a good fit of the model to the original data

7) Analysis of variance scheme.

∑(y - y avg) 2 =∑(y t - y avg) 2 +∑(y i - y t) 2 n – number of observations, m – number of parameters for variable x

Variance Components

Sum of squares

Number of degrees of freedom

Dispersion per degree of freedom

∑(y - y avg) 2

S 2 total =(∑(y - y avg) 2)/(n-1)

Factorial

∑(y t - y av) 2

S 2 fact =(∑(y t - y av) 2)/m

Residual

∑(y i - y t) 2

S 2 rest =(∑(y i - y t) 2)/ (n-m-1)

Analysis of variance

Components

Sum of squares

Number of degrees of freedom

Dispersion

general

factorial

residual

8) Checking the adequacy of the model according toF-Fisher criterion (α=0.05).

The assessment of the statistical significance of the regression equation as a whole is carried out usingF-Fisher criterion.

H 0 – hypothesis about the statistical significance of the regression equation.

H 1 – statistical significance of the regression equation.

F calculated is determined from the ratio of the values ​​of factor and residual variances calculated per degree of freedom.

F calculated = S 2 fact / S 2 rest = ((∑(y t - y av) 2)/m) / ((∑(y i - y t) 2)/ (n-m-1)) =1669.585177 / 79.13314895 = 21.09842966

F tabular - the maximum possible value of the criterion that could be formed under the influence of random factors with given degrees of freedom, i.e. TO 1 = m, TO 2 = n- m-1, and significance level α (α=0.05)

F table (0.05; 1; n-2), F table (0.05; 1; 10), F table = 4.964602701

IfF table < F calculation , then the hypothesisH 0 the random nature of the estimated characteristics is rejected, and their statistical significance and the reliability of the regression equation are recognized. OtherwiseH 0 is not rejected, and the statistical insignificance and unreliability of the regression equation is recognized. In our case F table< F расч, следовательно признаётся статистическая значимость и надёжность уравнения регрессии.

9) Assessment of the statistical significance of regression and correlation coefficients according tot-Student's t-test (α=0.05).

Assessing the significance of the coefficient. regression., t – Student criterion. Let’s check the statistical significance of parameter b.

Hypothesis H 0: b=0, t b (calc) = ׀b ׀/ m b, m b = S rest / (δ x
) , where n is the number of observations

m b = 79.13314895 / (12.57726123
) = 0,204174979

t b (calculated) = 0.937837482 / 0.204174979 = 4.593302697

t table is the maximum possible value of the criterion under the influence of random factors with given degrees of freedom (K=n-2), and significance level α (α=0.05). t table = 2.2281, If ​​t (calc) > t table, then the hypothesis H 0 is rejected, and the significance of the parameters of the equation is recognized.

In our case, t b (calculated) > t table, therefore the hypothesis H 0 is rejected, and the statistical significance of the parameter b is recognized.

Let's check the statistical significance of parameter a.

Hypothesis H 0: a=0 t a (calculated) = ׀а ׀/ m a
)/(n δ x), m a = (79.13314895
)/(12 12.57726123)= 17.89736655, t a (calculated) = 54.05926511 / 17.89736655=3.020515055

t a (calculated) > t table therefore the hypothesis H 0 is rejected, and the statistical significance of parameter a is recognized.

Assessing the significance of the correlation. Let's check the statistical significance of the correlation coefficient.

mrxy =
, mrxy =
=0.179320842, trxy = 0.823674909/ 0.179320842 = 4.593302697

tr = t b , tr > t table, therefore the statistical significance of the correlation coefficient is recognized.

Share with friends or save for yourself:

Loading...