Bivariate linear regression


Download 1.25 Mb.
bet2/2
Sana17.06.2023
Hajmi1.25 Mb.
#1526769
1   2
Bog'liq
e224n1208

Least squares line

  • Let the n observed values of x and y be termed xi and yi, where i = 1, 2, 3, ... , n.
  • ∑ε2 is minimized when b0 and b1 take on the following values:
  • Province
  • Income
  • Alcohol
  • Newfoundland
  • 26.8
  • 8.7
  • Prince Edward Island
  • 27.1
  • 8.4
  • Nova Scotia
  • 29.5
  • 8.8
  • New Brunswick
  • 28.4
  • 7.6
  • Quebec
  • 30.8
  • 8.9
  • Ontario
  • 36.4
  • 10
  • Manitoba
  • 30.4
  • 9.7
  • Saskatchewan
  • 29.8
  • 8.9
  • Alberta
  • 35.1
  • 11.1
  • British Columbia
  • 32.5
  • 10.9
  • Income is family income in thousands of dollars per capita, 1986. (independent variable)
  • Alcohol is litres of alcohol consumed per person 15 years of age or over, 1985-86. (dependent variable)
  • Is alcohol a superior good?
  • Sources: Saskatchewan Alcohol and Drug Abuse Commission,
  • Fast Factsheet, Regina, 1988
  • Statistics Canada, EconomIc Families – 1986 [machine-readable data file, 1988.

Hypotheses

  • H0: β1 = 0. Income has no effect on alcohol consumption.
  • H1: β1 > 0. Income has a positive effect on alcohol consumption.
  • Province
  • x
  • y
  • x-barx
  • y-bary
  • (x-barx)(y-bary)
  • x-barx sq
  • Newfoundland
  • 26.8
  • 8.7
  • -3.88
  • -0.6
  • 2.328
  • 15.0544
  • PEI
  • 27.1
  • 8.4
  • -3.58
  • -0.9
  • 3.222
  • 12.8164
  • Nova Scotia
  • 29.5
  • 8.8
  • -1.18
  • -0.5
  • 0.59
  • 1.3924
  • New Brunswick
  • 28.4
  • 7.6
  • -2.28
  • -1.7
  • 3.876
  • 5.1984
  • Quebec
  • 30.8
  • 8.9
  • 0.12
  • -0.4
  • -0.048
  • 0.0144
  • Ontario
  • 36.4
  • 10
  • 5.72
  • 0.7
  • 4.004
  • 32.7184
  • Manitoba
  • 30.4
  • 9.7
  • -0.28
  • 0.4
  • -0.112
  • 0.0784
  • Saskatchewan
  • 29.8
  • 8.9
  • -0.88
  • -0.4
  • 0.352
  • 0.7744
  • Alberta
  • 35.1
  • 11.1
  • 4.42
  • 1.8
  • 7.956
  • 19.5364
  • British Columbia
  • 32.5
  • 10.9
  • 1.82
  • 1.6
  • 2.912
  • 3.3124
  • sum
  • 306.8
  • 93
  • -6.8E-14
  • -7.1E-15
  • 25.08
  • 90.896
  • mean
  • 30.68
  • 9.3
  • b1
  • 0.275919732
  • b0
  • 0.834782609
  • SUMMARY OUTPUT
  • Regression Statistics
  • Multiple R
  • 0.790288
  • R Square
  • 0.624555
  • Adjusted R Square
  • 0.577624
  • Standard Error
  • 0.721104
  • Observations
  • 10
  • ANOVA
  •  
  • df
  • SS
  • MS
  • F
  • Significance F
  • Regression
  • 1
  • 6.920067
  • 6.920067
  • 13.30803
  • 0.006513
  • Residual
  • 8
  • 4.159933
  • 0.519992
  • Total
  • 9
  • 11.08
  •  
  •  
  •  
  •  
  • Coefficients
  • Standard Error
  • t Stat
  • P-value
  • Intercept
  • 0.834783
  • 2.331675
  • 0.358018
  • 0.729592
  • X Variable 1
  • 0.27592
  • 0.075636
  • 3.648018
  • 0.006513
  • Analysis. b1 = 0.276 and its standard error is 0.076, for a t value of 3.648. At α = 0.01, the null hypothesis can be rejected (ie. with H0, the probability of a t this large or larger is 0.0065) and the alternative hypothesis accepted. At 0.01 significance, there is evidence that alcohol is a superior good, ie. that income has a positive effect on alcohol consumption.

Uses of regression line

  • Draw line – select two x values (eg. 26 and 36) and compute the predicted y values (8.1 and 10.8, respectively). Plot these points and draw line.
  • Interpolation. If a city had a mean income of $32,000, the expected level of alcohol consumption would be 9.7 litres per capita.

Extrapolation

  • Suppose a city had a mean income of $50,000 in 1986. From the equation, expected alcohol consumption would be 14.6 litres per capita.
  • Cautions:
    • Model was tested over the range of income values from 26 to 36 thousand dollars. While it appears to be close to a straight line over this range, there is no assurance that a linear relation exists outside this range.
    • Model does not fit all points – only 62% of the variation in alcohol consumption is explained by this linear model.
    • Confidence intervals for prediction become larger the further the independent variable x is from its mean.

Change in y resulting from change in x

  • Estimate of change in y resulting from a change in x is b1.
  • For the alcohol consumption example, b1 = 0.276.
  • A 10.0 thousand dollar increase in income is associated with a 2.76 per litre increase in annual alcohol consumption per capita, at least over the range estimated.
  • This can be used to calculate the income elasticity for alcohol consumption.

Goodness of fit (ASW, 12.3)

  • y is the dependent variable, or the variable to be explained.
  • How much of y is explained statistically from the regression model, in this case the line?
  • Total variation in y is termed the total sum of squares, or SST.
  • The common measure of goodness of fit of the line is the coefficient of determination, the proportion of the variation or SST that is “explained” by the line.

SST or total variation of y

  • Difference of any observed value of y from the mean is the difference between the observed and predicted value plus the difference of the predicted value from the mean of y. From this, it can be proved that:
  • Difference from mean
  • “Error” of prediction
  • Value of y “explained” by the line
  • SST= Total variation of y
  • SSE = “Unexplained” or “error” variation of y
  • SSR = “Explained” variation of y

Variation in y

  • x
  • y
  • ŷ = b0 + b1x
  • yi
  • ŷi
  • xi

Variation in y “explained” by the line

  • x
  • y
  • ŷ = b0 + b1x
  • yi
  • ŷi
  • xi
  • “Explained” portion

Variation in y that is “unexplained” or error

  • x
  • y
  • yi
  • ŷi
  • xi
  • yi – ŷi
  • ŷ = b0 + b1x
  • ‘Unexplained” or error

Coefficient of determination

  • The coefficient of determination, r2 or R2 (the notation used in many texts), is defined as the ratio of the “explained” or regression sum of squares, SSR, to the total variation or sum of squares, SST.
  • The coefficient of determination is the square of the correlation coefficient r. As noted by ASW (483), the correlation coefficient, r, is the square root of the coefficient of determination, but with the same sign (positive or negative) as b1.
  • Calculations for:
  • Province
  • x
  • y
  • Predicted Y
  • Residuals
  • SSE
  • SSR
  • SST
  • Nfld
  • 26.8
  • 8.7
  • 8.229431
  • 0.470569
  • 0.221435
  • 1.146117
  • 0.36
  • PEI
  • 27.1
  • 8.4
  • 8.312207
  • 0.087793
  • 0.007708
  • 0.975734
  • 0.81
  • NS
  • 29.5
  • 8.8
  • 8.974415
  • -0.17441
  • 0.03042
  • 0.106006
  • 0.25
  • NB
  • 28.4
  • 7.6
  • 8.670903
  • -1.0709
  • 1.146833
  • 0.395763
  • 2.89
  • Que
  • 30.8
  • 8.9
  • 9.33311
  • -0.43311
  • 0.187585
  • 0.001096
  • 0.16
  • Ont
  • 36.4
  • 10
  • 10.87826
  • -0.87826
  • 0.771342
  • 2.490907
  • 0.49
  • Man
  • 30.4
  • 9.7
  • 9.222742
  • 0.477258
  • 0.227775
  • 0.005969
  • 0.16
  • SK
  • 29.8
  • 8.9
  • 9.057191
  • -0.15719
  • 0.024709
  • 0.058956
  • 0.16
  • Alb
  • 35.1
  • 11.1
  • 10.51957
  • 0.580435
  • 0.336905
  • 1.487339
  • 3.24
  • BC
  • 32.5
  • 10.9
  • 9.802174
  • 1.097826
  • 1.205222
  • 0.252179
  • 2.56
  • 4.159933
  • 6.920067
  • 11.08
  • R squared
  • 0.624555
  • SUMMARY OUTPUT
  • Regression Statistics
  • Multiple R
  • 0.790288
  • R Square
  • 0.624555
  • Adjusted R Square
  • 0.577624
  • Standard Error
  • 0.721104
  • Observations
  • 10
  • ANOVA
  •  
  • df
  • SS
  • MS
  • F
  • Significance F
  • Regression
  • 1
  • 6.920067
  • 6.920067
  • 13.30803
  • 0.006513
  • Residual
  • 8
  • 4.159933
  • 0.519992
  • Total
  • 9
  • 11.08
  •  
  •  
  •  

Interpretation of R2

  • Proportion, or percentage if multiplied by 100, of the variation in the dependent variable that is statistically explained by the regression line.
  • 0  R2  1.
  • Large R2 means the line fits the observed points well and the line explains a lot of the variation in the dependent variable, at least in statistical terms.
  • Small R2 means the line does not fit the observed points very well and the line does not explain much of the variation in the dependent variable.
    • Random or error component dominates.
    • Missing variables.
    • Relationship between x and y may not be linear.

How large is a large R2?

  • Extent of relationship – weak relationship associated with low value and strong relationship associated with large value.
  • Type of data
    • Micro/survey data associated with small values of R2. For schooling/earnings example, R2 = 0.253. Much individual variation.
    • Grouped data associated with larger values of R2. In income/alcohol example, R2 = 0.625. Grouping averages out individual variation.
    • Time series data often results in very high R2. In consumption function example (next slide), R2 = 0.988. Trends often move together.
  • GDP
  • Consumption
  • Consumption (y) and GDP (x), Canada, 1995 to 2004, quarterly data

Beware of R2

  • Difficult to compare across equations, especially with different types of data and forms of relationships.
  • More variables added to model can increase R2. Adjusted R2 can correct for this. ASW, Chapter 13.
  • Grouped or averaged observations can result in larger values of R2.
  • Need to test for statistical significance.
  • We want good estimates of β0 and β1, rather than high R2.
  • At the same time, for similar types of data and issues, a model with a larger value of R2 may be preferable to one with a smaller value.

Next day

  • Assumptions of regression model.
  • Testing for statistical significance.

Download 1.25 Mb.

Do'stlaringiz bilan baham:
1   2




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling