Generalized least squares in excel. Linear pairwise regression analysis. Statement of the problem using a specific example

Method least squares(LSM) is based on minimizing the sum of squared deviations of the selected function from the data under study. In this article we will approximate the available data using a linear functiony = a x + b .

Least square method(English) Ordinary Least Squares , O.L.S.) is one of the basic methods of regression analysis in terms of estimating unknown parameters regression models according to sample data.

Let's consider approximation by functions that depend only on one variable:

Linear: y=ax+b (this article)
: y=a*Ln(x)+b
: y=a*x m
: y=a*EXP(b*x)+с
: y=ax 2 +bx+c

Note: Cases of approximation by a polynomial from the 3rd to the 6th degree are considered in this article. Approximation by a trigonometric polynomial is considered here.

Linear dependence

We are interested in the connection between 2 variables X And y. There is an assumption that y depends on X according to linear law y = ax + b. To determine the parameters of this relationship, the researcher made observations: for each value of x i, a measurement of y i was made (see example file). Accordingly, let there be 20 pairs of values (x i; y i).

Note: If the change step is X is constant, then to build scatter plots can be used, if not, then you need to use the chart type Spot .

It is obvious from the diagram that the relationship between the variables is close to linear. To understand which of the many straight lines most “correctly” describes the relationship between variables, it is necessary to determine the criterion by which the lines will be compared.

As such a criterion we use the expression:

Where ŷ i = a * x i + b ; n – number of pairs of values (in our case n=20)

The above expression is the sum of the squared distances between the observed values of y i and ŷ i and is often denoted as SSE ( Sum of Squared Errors (Residuals), sum of squared errors (residuals)) .

Least square method is to select such a line ŷ = ax + b, for which the above expression takes the minimum value.

Note: Any line in two-dimensional space is uniquely determined by the values of 2 parameters: a (slope) and b (shift).

It is believed that the smaller the sum of squared distances, the better the corresponding line approximates the available data and can be further used to predict the values of y from the x variable. It is clear that even if in reality there is no relationship between the variables or the relationship is nonlinear, then OLS will still select the “best” line. Thus, the least squares method does not say anything about the presence of a real relationship between variables; the method simply allows you to select such function parameters a And b , for which the above expression is minimal.

By performing not very complex mathematical operations (for more details, see), you can calculate the parameters a And b :

As can be seen from the formula, the parameter a represents the ratio of covariance and , therefore in MS EXCEL to calculate the parameter A You can use the following formulas (see Linear sheet example file):

= KOVAR(B26:B45;C26:C45)/ DISP.G(B26:B45) or

= COVARIANCE.B(B26:B45;C26:C45)/DISP.B(B26:B45)

Also to calculate the parameter A you can use the formula = TILT(C26:C45;B26:B45). For parameter b use the formula = LEG(C26:C45;B26:B45) .

Finally, the LINEST() function allows you to calculate both parameters at once. To enter a formula LINEST(C26:C45;B26:B45) You need to select 2 cells in a row and click CTRL + SHIFT + ENTER(see article about). The value will be returned in the left cell A , on the right – b .

Note: To avoid messing with the input array formulas you will need to additionally use the INDEX() function. Formula = INDEX(LINEST(C26:C45,B26:B45),1) or just = LINEST(C26:C45;B26:B45) will return the parameter responsible for the slope of the line, i.e. A . Formula = INDEX(LINEST(C26:C45,B26:B45),2) will return the parameter responsible for the intersection of the line with the Y axis, i.e. b .

Having calculated the parameters, scatter diagram you can draw the corresponding line.

Another way to plot a straight line using the least squares method is the graph tool Trend line. To do this, select the diagram, select from the menu Layout tab, V group Analysis click Trend line, then Linear approximation .

By checking the “show equation in diagram” box in the dialog box, you can make sure that the parameters found above match the values in the diagram.

Note: In order for the parameters to match, the diagram type must be . The point is that when constructing a diagram Schedule X-axis values cannot be specified by the user (the user can only specify labels that do not affect the location of the points). Instead of X values, the sequence 1 is used; 2; 3; ... (for numbering categories). Therefore, if you build trend line on a type diagram Schedule, then instead of the actual values of X the values of this sequence will be used, which will lead to an incorrect result (unless, of course, actual values X do not match sequence 1; 2; 3; ...).

It has many uses as it allows for approximate representation given function others are simpler. LSM can be extremely useful in processing observations, and it is actively used to estimate some quantities based on the results of measurements of others containing random errors. In this article, you will learn how to implement least squares calculations in Excel.

Statement of the problem using a specific example

Suppose there are two indicators X and Y. Moreover, Y depends on X. Since OLS interests us from the point of view of regression analysis (in Excel its methods are implemented using built-in functions), we should immediately move on to considering a specific problem.

So, let X be the retail space of a grocery store, measured in square meters, and Y is the annual turnover, determined in millions of rubles.

It is required to make a forecast of what turnover (Y) the store will have if it has this or that retail space. Obviously, the function Y = f (X) is increasing, since the hypermarket sells more goods than the stall.

A few words about the correctness of the initial data used for prediction

Let's say we have a table built using data for n stores.

According to mathematical statistics, the results will be more or less correct if data on at least 5-6 objects is examined. In addition, “anomalous” results cannot be used. In particular, an elite small boutique can have a turnover that is several times greater than the turnover of large retail outlets of the “masmarket” class.

The essence of the method

The table data can be depicted on a Cartesian plane in the form of points M 1 (x 1, y 1), ... M n (x n, y n). Now the solution to the problem will be reduced to the selection of an approximating function y = f (x), which has a graph passing as close as possible to the points M 1, M 2, .. M n.

Of course you can use a polynomial high degree, but this option is not only difficult to implement, but also simply incorrect, since it will not reflect the main trend that needs to be detected. The most reasonable solution is to search for the straight line y = ax + b, which best approximates the experimental data, or more precisely, the coefficients a and b.

Accuracy assessment

With any approximation, assessing its accuracy is of particular importance. Let us denote by e i the difference (deviation) between the functional and experimental values for point x i, i.e. e i = y i - f (x i).

Obviously, to assess the accuracy of the approximation, you can use the sum of deviations, i.e., when choosing a straight line for an approximate representation of the dependence of X on Y, you should give preference to the one with the smallest value of the sum e i at all points under consideration. However, not everything is so simple, since along with positive deviations there will also be negative ones.

The issue can be solved using deviation modules or their squares. The last method is the most widely used. It is used in many areas including regression analysis(in Excel its implementation is carried out using two built-in functions), and has long proven its effectiveness.

Least square method

Excel, as you know, has a built-in AutoSum function that allows you to calculate the values of all values located in the selected range. Thus, nothing will prevent us from calculating the value of the expression (e 1 2 + e 2 2 + e 3 2 + ... e n 2).

In mathematical notation this looks like:

Since the decision was initially made to approximate using a straight line, we have:

Thus, the task of finding the straight line that best describes the specific dependence of the quantities X and Y comes down to calculating the minimum of a function of two variables:

To do this, you need to equate the partial derivatives with respect to the new variables a and b to zero, and solve a primitive system consisting of two equations with 2 unknowns of the form:

After some simple transformations, including division by 2 and manipulation of sums, we get:

Solving it, for example, using Cramer’s method, we obtain a stationary point with certain coefficients a * and b *. This is the minimum, i.e. to predict what turnover a store will have for a certain area, the straight line y = a * x + b * is suitable, which is a regression model for the example in question. Of course, it will not allow you to find the exact result, but it will help you get an idea of whether purchasing a specific area on store credit will pay off.

How to Implement Least Squares in Excel

Excel has a function for calculating values using least squares. It has the following form: “TREND” (known Y values; known X values; new X values; constant). Let's apply the formula for calculating OLS in Excel to our table.

To do this, enter the “=” sign in the cell in which the result of the calculation using the least squares method in Excel should be displayed and select the “TREND” function. In the window that opens, fill in the appropriate fields, highlighting:

range of known values for Y (in in this case data for trade turnover);
range x 1 , …x n , i.e. the size of retail space;
both known and unknown values of x, for which you need to find out the size of the turnover (for information about their location on the worksheet, see below).

In addition, the formula contains the logical variable “Const”. If you enter 1 in the corresponding field, this will mean that you should carry out the calculations, assuming that b = 0.

If you need to find out the forecast for more than one x value, then after entering the formula you should not press “Enter”, but you need to type the combination “Shift” + “Control” + “Enter” on the keyboard.

Some features

Regression analysis can be accessible even to dummies. The Excel formula for predicting the value of an array of unknown variables—TREND—can be used even by those who have never heard of least squares. It is enough just to know some of the features of its work. In particular:

If you arrange the range of known values of the variable y in one row or column, then each row (column) with known values of x will be perceived by the program as a separate variable.
If a range with known x is not specified in the TREND window, then when using the function in Excel, the program will treat it as an array consisting of integers, the number of which corresponds to the range with the given values of the variable y.
To output an array of “predicted” values, the expression for calculating the trend must be entered as an array formula.
If new values of x are not specified, then the TREND function considers them equal to the known ones. If they are not specified, then array 1 is taken as an argument; 2; 3; 4;…, which is commensurate with the range with already given parameters y.
The range containing the new x values must have the same or more rows or columns as the range containing the given y values. In other words, it must be proportional to the independent variables.
An array with known x values can contain multiple variables. However, if we are talking about only one, then it is required that the ranges with the given values of x and y be proportional. In the case of several variables, it is necessary that the range with the given y values fit in one column or one row.

PREDICTION function

Implemented using several functions. One of them is called “PREDICTION”. It is similar to “TREND”, i.e. it gives the result of calculations using the least squares method. However, only for one X, for which the value of Y is unknown.

Now you know formulas in Excel for dummies that allow you to predict the future value of a particular indicator according to a linear trend.

Least squares method (LSM)

System m linear equations with n unknowns has the form:

Three cases are possible: m n. The case when m=n was considered in the previous paragraphs. When m

If m>n and the system is consistent, then matrix A has at least m - n linearly dependent rows. Here the solution can be obtained by selecting n any linearly independent equations (if they exist) and applying the formula X = A -1 CV, that is, reducing the problem to a previously solved one. In this case, the resulting solution will always satisfy the remaining m - n equations.

However, when using a computer, it is more convenient to use a more general approach - the least squares method.

Algebraic least squares method

The algebraic least squares method is a method for solving systems of linear equations

by minimizing the Euclidean norm

Ax? b? >inf. (1.2)

Analysis of experimental data

Let us consider some experiment during which at moments of time

For example, temperature Q(t) is measured. Let the measurement results be specified by an array

Let us assume that the experimental conditions are such that measurements are carried out with a known error. In these cases, the law of temperature change Q(t) is sought using a certain polynomial

P(t) = + + + ... +,

determining the unknown coefficients, ..., from the considerations that the value E(, ...,), defined by the equality

gauss algebraic exel approximation

took the minimum value. Since the sum of squares is minimized, this method is called least squares approximation to data.

If we replace P(t) with its expression, we get

Let us set the task of defining an array so that the value is minimal, i.e. Let's define the array using the least squares method. To do this, we equate the partial derivatives to zero:

If you enter the m × n matrix A = (), i = 1, 2..., m; j = 1, 2, ..., n, where

I = 1, 2..., m; j = 1, 2, ..., n,

then the written equality will take the form

Let us rewrite the written equality in terms of operations with matrices. By the definition of multiplying a matrix by a column, we have

For a transposed matrix, a similar relationship looks like this

Let us introduce the notation: we will denote the i-th component of the vector Ax In accordance with the written matrix equalities, we will have

In matrix form this equality can be rewritten as

A T x=A T B (1.3)

Here A is a rectangular m×n matrix. Moreover, in data approximation problems, as a rule, m > n. Equation (1.3) is called a normal equation.

It was possible from the very beginning, using the Euclidean norm of vectors, to write the problem in equivalent matrix form:

Our goal is to minimize this function in x. In order for a minimum to be reached at a solution point, the first derivatives with respect to x at this point must be equal to zero. The derivatives of this function are

2A T B + 2A T Ax

and therefore the solution must satisfy the system of linear equations

(A T A)x = (A T B).

These equations are called normal equations. If A is an m× n matrix, then A>A - n × n is a matrix, i.e. The matrix of a normal equation is always a square symmetric matrix. Moreover, it has the property of positive definiteness in the sense that (A>Ax, x) = (Ax, Ax) ? 0.

Comment. Sometimes the solution to an equation of the form (1.3) is called a solution to the system Ax = B, where A is a rectangular m × n (m > n) matrix using the least squares method.

The least squares problem can be interpreted graphically as minimizing the vertical distances from data points to a model curve (see Figure 1.1). This idea is based on the assumption that all errors in the approximation correspond to errors in the observations. If there are also errors in the independent variables, then it may be more appropriate to minimize the Euclidean distance from the data to the model.

MNC in Excel

The algorithm below for implementing OLS in Excel assumes that all initial data are already known. We multiply both sides of the matrix equation AЧX=B of the system on the left by the transposed matrix of the system А Т:

A T AX=A T B

Then we multiply both sides of the equation on the left by the matrix (A T A) -1. If this matrix exists, then the system is defined. Considering that

(A T A) -1 *(A T A)=E, we get

X=(A T A) -1 A T B.

The resulting matrix equation is a solution to a system of m linear equations with n unknowns for m>n.

Let's consider the application of the above algorithm on specific example.

Example. Let it be necessary to solve the system

In Excel, the solution sheet in the formula display mode for this problem looks like this:

Calculation results:

The required vector X is located in the range E11:E12.

When solving a given system of linear equations, the following functions were used:

1. MOBR - returns inverse matrix for a matrix stored in an array.

Syntax: MOBR(array).

Array is a numeric array with an equal number of rows and columns.

2. MULTIPULT - returns the product of matrices (matrices are stored in arrays). The result is an array with the same number of rows as array1 and the same number of columns as array2.

Syntax: MULTIPLE(array1,array2).

Array1, array2 are multiplyable arrays.

After entering a function in the top left cell of an array range, select the array, starting with the cell containing the formula, press F2, and then press CTRL+SHIFT+ENTER.

3. TRANSPORT - converts a vertical set of cells into a horizontal one, or vice versa. As a result of using this function, an array appears with the number of rows equal to the number of columns of the original array, and the number of columns equal to the number of rows of the initial array.

4.1. Using built-in functions

Calculation regression coefficients carried out using the function

LINEST(Values_y; x-values; Const; statistics),

Values_y- array of y values,

x-values- optional array of values x, if array X is omitted, it is assumed that this is an array (1;2;3;...) of the same size as Values_y,

Const- a boolean value that indicates whether the constant is required b was equal to 0. If Const has the meaning TRUE or omitted, then b is calculated in the usual way. If the argument Const is FALSE, then b is assumed to be 0 and the values a are selected so that the relation is fulfilled y=ax.

Statistics is a boolean value that indicates whether additional regression statistics are required to be returned. If the argument Statistics has the meaning TRUE, then the function LINEST returns additional regression statistics. If the argument Statistics has the meaning LIE or omitted, then the function LINEST returns only the coefficient a and constant b.

It must be remembered that the result of the functions LINEST() is a set of values – an array.

For calculation correlation coefficient function is used

CORREL(Array1;Array2),

returning the values of the correlation coefficient, where Array1- array of values y, Array2- array of values x. Array1 And Array2 must be the same size.

EXAMPLE 1. Addiction y(x) is presented in the table. Build regression line and calculate correlation coefficient.

y		0.5		1.5		2.5		3.5
x		2.39	2.81	3.25	3.75	4.11	4.45	4.85	5.25

Let's enter a table of values into a MS Excel sheet and build a scatter plot. The worksheet will take the form shown in Fig. 2.

In order to calculate the values of regression coefficients A And b select the cells A7:B7, Let's go to the function wizard and in the category Statistical select a function LINEST. Let's fill in the dialog box that appears as shown in Fig. 3 and press OK.

As a result, the calculated value will appear only in the cell A6(Fig. 4). In order for the value to appear in the cell B6 you need to enter edit mode (key F2), and then press the key combination CTRL+SHIFT+ENTER.

To calculate the value of the correlation coefficient in a cell C6 was introduced following formula:

C7=CORREL(B3:J3;B2:J2).

Knowing the regression coefficients A And b let's calculate the function values y=ax+b for given x. To do this, we introduce the formula

B5=$A$7*B2+$B$7

and copy it to the range C5:J5(Fig. 5).

Let's plot the regression line on the diagram. Select the experimental points on the graph, right-click and select the command Initial data. In the dialog box that appears (Fig. 5), select the tab Row and click on the button Add. Let's fill in the input fields as shown in Fig. 6 and press the button OK. A regression line will be added to the experimental data graph. By default, its graph will be drawn as points not connected by smoothing lines.

To change the appearance of the regression line, perform the following steps. Right-click on the points depicting the line graph and select the command Chart type and set the type of scatter diagram, as shown in Fig. 7.

The line type, color and thickness can be changed as follows. Select a line on the diagram, right-click and select the command in the context menu Data series format... Next, make settings, for example, as shown in Fig. 8.

As a result of all transformations, we obtain a graph of experimental data and a regression line in one graphical area (Fig. 9).

4.2. Using a trend line.

The construction of various approximating dependencies in MS Excel is implemented as a chart property - trend line.

EXAMPLE 2. As a result of the experiment, a certain table dependence was determined.

0.15	0.16	0.17	0.18	0.19	0.20
4.4817	4.4930	5.4739	6.0496	6.6859	7.3891

Select and construct an approximating dependence. Construct graphs of tabular and selected analytical dependences.

Solving the problem can be divided into the following stages: entering initial data, constructing a scatter plot and adding a trend line to this graph.

Let's look at this process in detail. Let's enter the initial data into the worksheet and plot the experimental data. Next, select the experimental points on the graph, right-click and use the command Add l trend line(Fig. 10).

The dialog box that appears allows you to build an approximating relationship.

The first tab (Fig. 11) of this window indicates the type of approximating dependence.

On the second (Fig. 12) the construction parameters are determined:

· name of the approximating dependence;

· forecast forward (backward) by n units (this parameter determines how many units forward (backward) the trend line needs to be extended);

whether to show the point of intersection of a curve with a straight line y=const;

· show the approximating function on the diagram or not (the option to show the equation on the diagram);

· whether to place the value of the standard deviation on the diagram or not (the option to place the value of the approximation reliability on the diagram).

Let us choose a polynomial of the second degree as an approximating dependence (Fig. 11) and display the equation that describes this polynomial on a graph (Fig. 12). The resulting diagram is shown in Fig. 13.

Similarly using trend lines you can select the parameters of such dependencies as

linear y=a∙x+b,

logarithmic y=a∙ln(x)+b,

· exponential y=a∙e b,

· sedate y=a∙x b,

polynomial y=a∙x 2 +b∙x+c, y=a∙x 3 +b∙x 2 +c∙x+d and so on, up to a polynomial of the 6th degree inclusive,

· linear filtration.

4.3. Using the Options Analysis Tool: Finding a Solution.

Of significant interest is the implementation in MS Excel of selecting the parameters of a functional relationship using the least squares method using the option analysis tool: Search for a solution. This technique allows you to select the parameters of a function of any type. Let's consider this possibility using the following problem as an example.

EXAMPLE 3. As a result of the experiment, the dependence z(t) was obtained, presented in the table

0,66	0,9	1,17	1,47	1,7	1,74	2,08	2,63	3,12
38,9	68,8	64,4	66,5	64,95	59,36	82,6	90,63	113,5

Select dependence coefficients Z(t)=At 4 +Bt 3 +Ct 2 +Dt+K least squares method.

This problem is equivalent to the problem of finding the minimum of a function of five variables

Let's consider the process of solving the optimization problem (Fig. 14).

Let the values A, IN, WITH, D And TO stored in cells A7:E7. Let's calculate the theoretical values of the function Z(t)=At 4 +Bt 3 +Ct 2 +Dt+K for given t(B2:J2). To do this, in the cell B4 enter the value of the function at the first point (cell B2):

B4=$A$7*B2^4+$B$7*B2^3+$C$7*B2^2+$D$7*B2+$E$7.

Let's copy this formula into the range C4:J4 and get the expected value of the function at the points whose abscissas are stored in the cells B2:J2.

To cell B5 Let's introduce a formula that calculates the square of the difference between the experimental and calculated points:

B5=(B4-B3)^2,

and copy it to the range C5:J5. In a cell F7 we will store the total squared error (10). To do this, enter the formula:

F7 = SUM(B5:J5).

Let's use the command Service®Search for a solution and solve the optimization problem without restrictions. Let's fill in the input fields in the dialog box shown in Fig. accordingly. 14 and press the button Execute. If a solution is found, the window shown in Fig. 15.

The result of the decision block will be output to cells A7:E7parameter values functions Z(t)=At 4 +Bt 3 +Ct 2 +Dt+K. In cells B4:J4 we get expected function value at the starting points. In a cell F7 will be stored total square error.

You can display experimental points and a fitted line in one graphic area by selecting a range B2:J4, call Chart Wizard, and then format the appearance of the resulting graphs.

Rice. 17 displays the MS Excel worksheet after the calculations have been performed.