UCSC Lab 17 | Data Art

Poli 101

LAB 17

Multiple Regression

PURPOSE

To learn how to perform and interpret multiple regression analysis.
To understand the meaning and use of Beta.
To learn about different regression procedures.

MAIN POINTS

Multiple Regression

Regression analysis can be performed with more than one independent variable. Regression involving two or more independent variables is called multiple regression.
Hence, the multiple regression equation takes the form:
- Y = A + b₁x₁ + b₂x₂ +b₃x₃ … + b_nx_n
- Dependent Variable = Constant + (Coefficient₁ X Independent Variable₁) + (Coefficient₂ X Independent Variable₂) + (Coefficient₃ X Independent Variable₃) + … etc.
Y is the predicted value of the dependent variable given the values of the independent variables (X₁, X_2, X₃…etc.).
What is unique about multiple regression is that for each of the independent variables, the analysis controls for the effect of the other independent variables. This means that the effect of any independent variable is estimated apart from the effect of all the other independent variables. In this way, it accomplishes the same sort of controlled comparisons as control tables.
Interpretation of the unstandardized coefficients (b) is much as it is in bivariate regression, i.e., a unit of change in the independent variable affects the dependent variable by a factor indicated by the regression coefficient (b). Only in this case it does so while controlling for the effects of the other independent variables.
The R² value for regression is similar to the r² in bivariate regression. The R² for multiple regression indicates the proportion of variation in the dependent variable explained collectively by the set of independent variables taken. R² increases with each independent variable added to the regression model, even when the added variables have no effect on the dependent variable. Therefore we use an Adjusted R²that corrects for this artificial inflation of R² in multiple regression models .

b & Beta

The relative size of two b values is not necessarily a good indication of which independent variable is a better predictor of the dependent variable since the magnitude of b depends on its particular units of measurement. We can often make better comparisons among independent variables by standardizing the variables. This is done by converting the values of an independent variable into units of standard deviation from its mean.
Beta is the standardized version of b. It indicates the effect that one standard deviation unit change in the independent variable has on the dependent variable (also measured in standard deviation units). The use of Beta coefficients facilitates comparisons among independent variables since they are all expressed in standardized units.
Values of b and Beta are both calculated in a multiple regression analysis. Values of b are used in formulating a multiple regression equation. However, b values do not have a common benchmark for comparison since the b values depend on how the variables are coded.
Betas from the same multiple regression analysis can be readily compared to one another. The higher the Beta value, the stronger the relationship the respective independent variable has with the dependent variable.
Comparing Betas from equations based on different samples can be misleading, however, since the variance of the standard errors for the samples may differ substantially. In such cases it is best to report the unstandardized b value.
Values of b allow us to understand the theoretical importance of an independent variable. When variables are measured in concrete units like dollars, years, or percentages, b is relatively easy to interpret because it expresses the potential effects of an independent variables on the dependent variable in both their original units of measurement. The meaning of Beta is not intuitively clear and cannot be interpreted concretely, but when independent variables are measured in different units only Betas allow us to compare directly the effects of different independent variables on the dependent variables.

Multicollinearity

When some of the independent variables are very closely related to one another, multiple regression does not allow us to separate out their independent effects on the dependent variable. This is referred to as Multicollinearity (several things being on the same regression line).
Multicollinearity usually occurs when two or more independent variables measuring the same concept are entered into the regression equation. It often results in strong, but statistically insignificant, regression coefficients (due to large standard errors). We look for multicollinearity either by examining the correlations among our independent variables or, more rigorously, by requesting tolerance measures (tol) as part of our regression analysis, using the statistics subcommand.
Tolerance levels indicate the extent to which an independent variable is related to other independent variables in the model. Its values range from zero (.00) to one (1.0). A tolerance of 1.0 means a variable is unrelated to the other independent variables. A tolerance of .00 means an independent variable is strongly related to another independent variable.
Multicollinearity only becomes a problem as tolerance approaches zero. As a general rule, a tolerance score of .20 or less indicates that collinearity is a problem. When this is found to be the case, eliminate one of the variables involved, or combine them into an index.

Example #1 Multiple Regression Using Public Opinion Data

- Dataset:
  - ANES 2012
- Dependent Variable:
  - Economic Equality RawEqIndex (Alpha .70),
    - Indicators: EcEq1(cses_govtact),
    - EcEq2 (ineqinc_ineqreduc)
    - EcEq3 (guarpr_self).
- Independent Variables:
- Hypothesis Arrow Diagram:

Syntax

weight by weight_full.

*Coding the DV's indicators*.
missing values cses_govtact (-9 thru -6).
recode cses_govtact (1=1) (2=.75) (3= .5) (4= .25) (5=0) into eceq1.

missing values ineqinc_ineqreduc (-9 thru -6).
recode ineqinc_ineqreduc (1=1) (2=0) (3= .5) into eceq3.

missing values guarpr_self (-9 thru -2).
recode guarpr_self (1=1) (2=.832) (3= .666) (4= .5) (5= .332) 
 (6= .166) (7=0) into eceq5.

*Constructing the DV's Index*.
compute RawEqIndex = eceq1 + eceq3 + eceq5.

*Coding the IVs*.
*partisan feeling thermometers*.
missing values ft_dem ft_rep(-2, -8, -9).

*Econ-past & future*.
missing values econ_ecpast_x (-9 thru -1).
missing values econ_ecnext_x (-9 thru -1).

*Correlation matrix-DV & IVs*.
correlations RawEqIndex ft_dem ft_rep econ_ecpast_x econ_ecnext_x.

*Regression of EconEq Index on Party Feeling Therms & Economy-Past&Future.
regression variables = RawEqIndex ft_dem ft_rep econ_ecpast_x econ_ecnext_x
  /statistics anova coeff r tol
  /descriptives = n
  /dependent = RawEqIndex
  /method = enter.

Syntax Legend

The relevant variables are recoded into new variable names and missing values are declared.
A correlation matrix is run to examine the relationships between the DV and the IVs as well as among the IVs
The regression procedure identifies the variables to be used in the equation.
The statistics subcommand asks for the output to include anova, basic regression and correlation coefficients as well as the tolerances (tol), a collinearity diagnostic measure.
The descriptives subcommand asks for output to indicate the number of cases on which the regression is calculated.
The dependent subcommand indicates that the RawEqIndex is the dependent variable.
The method subcommand says to enter the variables into the equation.

SPSS Output

Correlation Procedure

Correlations
		RawEqIndx	FT Dem	FT Rep	Ec 1 yr –	Ec 1 yr +
	RawEqIndex	1.000
	Feeling Dem	.511	1.000
	Feeling Rep	-.439	-.512	1.000
	Econ 1 yr –	-.337	-.465	.397	1.000
	Econ 1 yr +	-.250	-.396	.231	.504	1.000

Regression Procedure

Correlations (note this is inaptly named)
		RawEqIndx	Feel Dem	Feel Rep	Econ yr ago	Econ yr more
N	RawEqIndex	4877	4877	4877	4877	4877
	Feeling Dem	4877	4877	4877	4877	4877
	Feeling Rep	4877	4877	4877	4877	4877
	Econ 1 yr ago	4877	4877	4877	4877	4877
	Econ yr fr now	4877	4877	4877	4877	4877

Variables Entered/Removed^a
Model	Variables Entered	Variables Removed	Method
1	Econ 1 yr + Econ 1 yr – Feel Dem Feel Rep ic Party^b	.	Enter
a. Dependent Variable: RawEqIndex
b. All requested variables entered.

Model Summary
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.552^a	.305	.304	.70176
a. Preds: Econ1 yr+, Econ yr 1 -, Feel Dem, Feel Rep

ANOVA^a
Model		Sum of Squares	df	Mean Square	F	Sig.
1	Regression	1051.085	4	262.771	533.580	.000^b
	Residual	2399.481	4872	.492
	Total	3450.566	4876
a. Dependent Variable: RawEqIndex
b. Predictors: Econ1 yr+, Econ yr 1 -, Feel Dem, Feel Rep

(Regression) Coefficients

	Unstandardized Coefficients		Standardized Coefficients	t	Sig.	Collinearity Statistics
	B	Std. Error	Beta	t	Sig.	Tolerance	VIF
(Constant)	1.325	.057		23.232	.000
Dem Feel	.010	.000	.350	23.052	.000	.621	1.612
Rep Feel	-.007	.000	-.224	-15.643	.000	.695	1.439
Econ Yr 1+	-.054	.011	-.072	-4.775	.000	.631	1.584
Econ Yr 1-	-.016	.012	-.019	-1.316	.188	.705	1.418

Dependent Variable: RawEqIndex

Interpretation of output

The correlation procedure results show some very strong relationships between DV & IVs, but the IVs are conceptually distinct from the DV. They also show some strong relationships among the DVs, suggesting caution over whether the two feeling thermometers (r= -.512) are actually measuring essentially the same thing as well as whether the two economic measures (r=.504) are measuring more or less the same thing. These results suggest looking carefully at tolerance scores in the regression analysis.

The regression procedure produces an inaptly named table entitled correlations. It actually shows the number of cases (N) on which the regression is based.

The model summary table reports and R-square indicates explained variance of approximately 30%.

The ANOVA table is used to assess the significance of the overall model. In this case it is .000, indicating a very small chance of the results being due to sampling error.

In the coefficients table the b values indicate the direction and amount of change in the dependent variable associated with a single unit’s change in each independent variable. In this example, the indicators for DemFeel and RepFeel are measured at the interval level, with scores ranging from 0 through 100, whereas the Economic indicators are ordinal measure with three values. The regression results show that attitudes supporting economic equality depend to a significant degree on the first three independent variables. Moreover, the b value for the Democratic feeling thermometer are positive so higher values on this predictor are associated with more support for economic equality. The other predictors’ b -values are all negative, so higher levels of each of these independent variables are associated with lower values on the dependent variable. Since the independent variables are not measured on the same scale, however, the b values cannot be directly compared.

The Beta values indicate the relative influence of the variables in comparable (standard deviation) units. We can see from the Beta values that feelings toward the Democratic Party have a greater impact on the Economic Equality Index than do any of the other predictors.

The tolerance levels indicate that while there is some concern over collinearity due to a correlation between the independent variables, it is not serious.

The significance of the individual independent variables is indicated by a version of the T-test. The T-ratio (or score) is calculated by dividing the b value by the standard error of b. This isn’t as evident for Democratic and Republican feeling thermometers as it is for economic condition predictors due to rounding. (The actual b value for DemFeel is .010295 and its standard error is .000447, each of which can be seen by double-clicking on the relevant coefficients in the SPSS output.) As is usual in significance testing, a T-ratio reaches the .05 level of statistical significance at 1.96. The first three variables easily exceed this value and therefore we can be confident that their respective relationships with the DV are not due to chance. The prospective economic measure’s T-ratio is -1.32, signifying the relationship could well be due to chance and hence is regarded as statistically insignificant.

The constant (or y-intercept) indicates the value of ‘a’ in the regression equation.

One can write the regression equation using the information provided in the output detailing the a and b values. The regression equation here is :

RawEqIndex = 1.33 +.010(DemFeel) – .007(RepFeel) -.054(Econ 1yr+) -.016(Econ Yr1-).

This equation can be used to predict attitudes to economic equality levels for different combinations of values on the independent variables, e.g. those who rate the Democrats at 75 with improving personal finances, though this is rarely of concern in theoretically based social science research which focuses more upon estimating the relationships between independent and dependent variables.

INSTRUCTIONS

Multiple Regression

Select an available public opinion dataset of interest and review the questionaire.
Hypothesize a relationship with a dependent variable and at least two independent The variables can be either interval or ordinal or a nominal variable coded as a dichotomy.
- For example, as shown in the example shown above, partisan feelings and personal finances may both affect income levels.
Based on a Frequency run, decide how to recode each variable (if necessary) and declare missing values.
You may wish to create a correlation matrix with your variables to ensure that your independent variables are related to your dependent variable (and to ensure that your independent variables are not so closely related to one another that multi-collinearity will present a problem).
Create and run the appropriate syntax in SPSS to run a regression analysis with two independent variables.
Based on the multiple regression output, determine whether the overall equation is significant (Sig.<.05) and if so whether the independent variables have significant effects on the dependent variable.
Perhaps add a third independent variable to your regression. In the example used here you might try one of the following variables:
*Additional IVs*.
*Demographics*.
missing values dem_agegrp_iwdate (-9 thru -1).
missing values inc_incgroup_pre (-9 thru -1).
missing values dem_edugroup (-9, -2).

Example #2: Multiple Regression working with subsets of cases.

Hypotheses:

Partisan attitudes and personal finances have a greater influence on egalitarian attitudes among men than among women.

Syntax:

*Create female indicator.
recode gender_respondent (1=0) (2=1) into female.

*Re-run the regression within gender groupings*.
Temporary.
Select if female =1.

regression variables = RawEqIndex ft_dem ft_rep econ_ecpast_x econ_ecnext_x
  /statistics anova coeff r tol
  /descriptives = n
  /dependent = RawEqIndex
  /method = enter.

Temporary.
Select if female =0.

regression variables = RawEqIndex ft_dem ft_rep econ_ecpast_x econ_ecnext_x
  /statistics anova coeff r tol
  /descriptives = n
  /dependent = RawEqIndex
  /method = enter.

Syntax Legend

These commands must be used in conjunction with the recodes used in the prior example.
The Temporary and Select if commands are used to select subsets of cases. In this case, subsetting allows us to run the same regression analysis separately for women and men. In this case, respondents’ gender is distinguished using the dichotomous variable Female.
As in the prior example, the regressions again estimate the relative and joint effect of partisan feelings and perceived personal finances on egalitarian attitudes. However, in this case, the separate regressions are run for male and female respondents. The Anova subcommand has not been included in this example.

SPSS Output

Female = 1.

Correlations
		RawEqIndx	Feel Dem	Feel Rep	Econ yr ago	Econ yr more
N	RawEqIndex	2473	2473	2473	2473	2473
	Feeling Dem	2473	2473	2473	2473	2473
	Feeling Rep	2473	2473	2473	2473	2473
	Econ 1 yr ago	2473	2473	2473	2473	2473
	Econ yr fr now	2473	2473	2473	2473	2473

Model Summary
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.529^a	.280	.279	.68417
a. Preds: Econ1 yr+, Econ yr 1 -, Feel Dem, Feel Rep

Cofficients

	Unstandardized Coefficients		Standardized Coefficients	t	Sig.	Collinearity Statistics
	B	Std. Error	Beta	t	Sig.	Tolerance	VIF
(Constant)	1.428	.081		17.703	.000
Dem Feel	.009	.001	.331	14.067	.000	.589	1.697
Rep Feel	-.006	.001	-.221	-10.490	.000	.655	1.527
Econ Yr 1+	-.049	.015	-.068	-3.169	.002	.626	1.598
Econ Yr 1-	-.034	.017	-.041	-2.022	.043	.707	1.414

Dependent Variable: RawEqIndex

Female = 0

Correlations
		RawEqIndx	Feel Dem	Feel Rep	Econ yr ago	Econ yr more
N	RawEqIndex	2404	2404	2404	2404	2404
	Feeling Dem	2404	2404	2404	2404	2404
	Feeling Rep	2404	2404	2404	2404	2404
	Econ 1 yr ago	2404	2404	2404	2404	2404
	Econ yr fr now	2404	2404	2404	2404	2404

Model Summary
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.574^a	.329	.328	.71620
a. Preds: Econ1 yr+, Econ yr 1 -, Feel Dem, Feel Rep

Coefficients

	Unstandardized Coefficients		Standardized Coefficients	t	Sig.	Collinearity Statistics
	B	Std. Error	Beta	t	Sig.	Tolerance	VIF
(Constant)	1.243	.081		15.310	.000
Dem Feel	.012	.001	.377	18.047	.000	.624	1.559
Rep Feel	-.008	.001	-.238	-12.128	.000	.729	1.372
Econ Yr 1+	-.062	.015	-.078	-3.697	.000	.629	1.589
Econ Yr 1-	-.007	.018	-.008	-.407	.684	.701	1.427

Dependent Variable: RawEqIndex

The output appears in two sections. In the first portion Female=1 thereby selecting only female respondents (N=456). The second portion selects for cases in which Female=0, so only males are included (N=375).

Interpretation of Ouput:
The equation for males accounts for nearly thirty-three percent of the variation in the dependent variable. The signs on all the coefficients are as before. And the first three IVs are again significant with the last one insignificant. The Beta coefficients again show partisan feeling is a more important as a predictor of egalitarianism than economic circumstances.
The equation for females accounts for approximately twenty-eight percent of the variation in egalitarianism. The signs remain as before, however, all the coefficients are significant.

Comparing the Beta coefficients for DemFeel and RepFeel across the two gender groups suggests that partisan feeling may have a greater impact on attitudes toward economic equality among males than among females. And checking the b values and their associated standard errors suggests that this difference may well be due to chance since the values overlap when one considers each in the context of +/- 1.96 its relevant standard deviation, though the RepFeel comes close.
Example # 3 Multiple Regression with worlddata (Aggregate) Data

REGRESSION variables = IncomeShareTop10 CivilLiberties TransparencyIndex
   /statistics coeff r tol/descriptives =n
   /dependent= IncomeShareTop10 /method=enter.

Comments on Aggregate Data Syntax

The regression command lists both dependent and independent variables.

The statistics subcommand asks for regression coefficients, explained variance (r), and tolerance. Anova has again been omitted but can be reinserted.

The descriptives subcommand asks for the number of cases used in the regression.

The dependent variable is declared with the dependent subcommand.

The method subcommand indicates that all the independent variables should be entered together.

Example # 3 Output

Correlations
		Income held by highest 10%.	Freedom House score	Transparency IIndex
N	Income share held by highest 10%.	123	123	123
	Freedom House score	123	123	123
	Transparency Index,	123	123	123

Model Summary
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.365^a	.133	.119	6.29900
a. Predictors: (Constant), Transparency Index, Freedom House score

Coefficients

Model		Unstandardized Coefficients		Standardized Coefficients	Sig.
Model		B	Std. Error	Beta	Sig.	Tolerance
1	(Constant)	35.660	1.504		.000
	Freedom House	.054	.055	.118	.334	.487
	Transparency Index,	-1.429	.396	-.440	.000	.487

Interpretation of output
Adjusted R-square indicates explained variance of approximately 12%.

The b values indicate the direction and number of units (as coded) of change in the dependent variable due to a one unit change in each independent variable. The Freedom House rating of civil liberties in a country is positively related to Income inequality (054). And greater transparency of a nation’s government is related to less concentration in its income (b= -1.429). Since the independent variables do not use the same measurement scale the b values cannot be directly compared.

The Beta coefficients indicate the relative influence of the variables in comparable (standard deviation) units. The transparency rating has roughly four times the influence of the freedom rating on the DV.

The tolerance scores indicate that the independent variables are likely correlated but not to such an extent that they measure the same thing.

The influence of the freedom score is no greater than one would expect due to chance. In contrast, the transparency rating is statistically significant.

The constant (or y-intercept) indicates the value of ‘a’ in the regression equation.

With this information one can write the regression equation:

Income inequality = .054(freedom) – 1.429(transparency)

QUESTIONS FOR REFLECTION

What is the difference between Pearson’s r analysis and multiple regression?
Why do the values of the coefficients differ based on the combination of the independent variables that are included in the analysis?
How can we visualize the results of a multiple regression equation?

DISCUSSION

Multiple regression is distinct from Pearson’s correlation insofar as it allows us to determine the relative effects of an independent variable upon a given dependent variable while controlling for the effect of all the other variables in the equation. In contrast, correlation analysis only allows us to compare the uncontrolled relationships between two variables.
There may be some change in the value of the coefficients when different combinations of variables are included in the regression because the analysis controls for the effects of all the other variables included in the equation.

A three dimensional scatterplot can be created using:

graph
   /scatterplot(xyz)=IV1 with DV with IV2.

graph
 /scatterplot(xyz)=ft_dem with RawEqIndex with econ_ecnext_x.