new Lab 19a Statistical Control in Regression
POLI 101 LAB 19a
Statistical Control in Regression
PURPOSE
- To learn how to control for a third variable with regression
- To learn how to compare relationships before and after statistical controls.
- To learn how to distinguish among types of relationships: replication, specification, explanation and interpretation.
MAIN POINTS
- Regression allows us to examine the relationship between two variables while controlling for a third variable. We do this in order to determine what effect the independent variable has after we take into account the influence of a third (control) variable. We add a control variable to a bivariate relationship to estimate not only the influence of the control variable on the dependent variable but also to see what effect the control variable may have on the original relationship .
- For example, if we predict attitudes regarding recreational marijuana (the dependent variable) using the independent variable political party affiliation while controlling for Gender, we can see not only the effect of gender on the DV but also whether the relationship between partisan affiliation and marijuana attitude is affected.
- Generally speaking, a second /method=enter subcommand is used to to add a control variable in a separate step or stage of a regression analysis. This is generally called hierarchical regression, though some authors describe this as conducting regression analysis in stages. Irrespective of the label, this technique entails using a fixed order of entry for our predictors.
- The first model produced by the regression analysis will show the relationship between Party Affiliation and Marijuana attitude without any controls.
- The second model will show the relationship between party affiliation (X1) and marijuana attitude (Y) as well as the relationship between gender (X2) and marijuana attitude (Y).
- The most common findings findings with hierarchical regression are replication, explanation/interpretation and specification. Replication is discussed in the first part of this lab. Explanation and Interpretation will be discussed in the second part. Specification will be discussed in next lab.
- With Replication: The regression coefficient (b) remains more or less the same before and after the second predictor is entered into the equation. This implies that the original relationship holds true even when taking into account the control variable.
- For example, if the regression coefficient for Party Affiliation remains remains much the same after adding a second predictor, then we would conclude that the result is replication
- Thus we might find that Party Affiliation predicts Marrijuana attitudes equally before and after adding a dummy variable for gender such as female.
- With Replication: The regression coefficient (b) remains more or less the same before and after the second predictor is entered into the equation. This implies that the original relationship holds true even when taking into account the control variable.
EXAMPLE 1: replication
Variables, Missing Values & Recodes
- Dataset: PPIC October 2016
- Independent Variable: Party Affiliation
- Dependent Variable: Attitudes regarding Recreational Marijuana
- Control Variable: Gender (coded as Female)
- Preliminary Hypothesis: the relationship between Party Affiliation and Marijuana Attitude is essentially the same before and after controlling for Gender.
Syntax
*Weighting the Data*. weight by weight. *Recoding MJ Index Items*. recode q21 (1=1) (2=0) into MJPropD. value labels MJPropD 1 'yes' 0 'no'. recode q36 (1=1) (2=0) into MJLegalD. value labels MJLegalD 1 'yes' 0 'no'. recode q36a (1=1) (2=.5) (3=.0) into MJTry. value labels MJTry 1 'recent' .5 'not recent' 0 'no'. *Constructing an Index with alpha = .777*. compute RawMJ3 = (MJPropD + MJLegalD + MJTry). *Creating IV Indicators of Party Identification & Ideology*. *Democrat5 (adapted from from lab 7)*. if (q40c = 1) and (q40e =1) Democrat5 =0. if (q40c = 1) and (q40e =2) Democrat5 =.25. if (q40c = 3) Democrat5 =.5. if (q40c =2) and (q40d =2) Democrat5 = .75. if (q40c =2) and (q40d=1) Democrat5 =1. value labels Democrat5 0 'strRep' .25 'Rep' .5 'Indep' .75 'Dem' 1 'strDem'. recode q37 (1=1) (2=.75) (3= .5 ) (4=.25) (5= 0) into liberal5. value labels liberal5 1 'vlib' .75 'liberal'.5 'middle' .25 'conserv' 0 'vcons'. *Additional control variables*. recode gender (1=0) (2=1) into female. q38 (1=1) (2=.66) (3=.33) (4=0) into interest. value labels interest 0 'none' .33 'only a little' .66 'fair amount' 1 'great deal'. regression variables=RawMJ3 Democrat5 female /statistics anova coeff r tol /descriptives = n /dependent = RawMJ3 /method = enter Democrat5 /method = enter female.
Output
Elaboration Models
Replication
Example 1: Gender
Standardized Coefficients
Model 2 | |||
Model 1 | |||
democrat5 | .209*** | .230*** | |
female | -.186*** | ||
Adj R2 | .043 | .062 | |
Tol = | .987 | ||
N = | (963) | (963) |
Unstandardized Coefficients
Model 2 | |||
Model 1 | |||
constant | 1.108(.072) | 1.277(.076) | |
democrat5 | .734 (.111) | .808(.110) | |
female | -.429(.072) | ||
Adj R2 | .043 | .062 | |
N = | (963) | (963) |
- Interpretation of Results
- The results are presented first using standardized coefficients and then using unstandardized coefficients. In each instance two models are shown explaining variation in attitudes toward recreational marijuana. The first model presents a bivariate regression with only party affiliation as the IV. It explains just over 4% of the variation in views regarding the recreational use of marijuana. The second is a multivariate model explaining 6% of the variation in attitudes toward marijuana through a linear combination of party affiliation and a gender dummy (female). Looking first to the standardized results, the findings show that the more democratic a respondent is, the greater support for recreational marijuana. And compared with males, females are less supportive. Moreover, the influence of partisanship is greater than that of gender.
- Controlling for gender does not appreciably affect the influence of partisanship on attitudes. This is evident in several ways. First, the partisan coefficient remains statistically significant after controlling for gender. Second, referring to the rough and ready “rule of thirds” the partisanship coefficient does not drop by anything approaching a third of its value. Third, the more statistically rigorous technique utilizing its standard error suggests the change in partisanship is not beyond what one might attribute to chance. For this we turn to the unstandardized results.
- As you will recall from our discussion of statistical significance, 95% of the cases on a normal distribution fall within about two (more precisely 1.96) standard errors on each side of a point estimate. Therefore, just as with measures of association in analyzing control tables, multiplying the standard error by two (1.96) and adding and subtracting the result to the coefficient can be used to determine significant differences in unstandardized coefficients. In this case, 1.96 (.111) =.218, hence the b value for Democrat 5 in model 2 would have to be greater than .952 (or less than .456) to differ significantly from the .734 of model 1.
- By any of these standards, this is a case of replication. So we can conclude that the original relationship between Support for Marijuana and Ideology holds even after applying our control for gender
Example 2: Interest
-
- Preliminary Hypothesis: the relationship between Party Affiliation and Marijuana Attitude is essentially the same before and after controlling for Interest.
Standardized CoefficientsModel 2 Model 1 Democrat5 .213*** .219*** interest .136*** Adj R2 .043 .060 Tol .998 N = (950) (950)
- Preliminary Hypothesis: the relationship between Party Affiliation and Marijuana Attitude is essentially the same before and after controlling for Interest.
Interpreting the Regression Results
- In this case, only the standardized results are presented.
- Model 1 again presents the bivariate relationship between Party Affiliation and Attitudes regarding Recreational Marijuana.
- Model 2 adds Political Interest to the analysis. This increases R-square from roughly 4% to 6%.
- The Beta coefficients reveal Partisanship to be a better predictor than Interest. Moreover, the effect of Partisanship remains essentially unchanged after the addition of the control variable to the equation. Note also that the relationship remains statistically significant before and after controlling for interest.
- This is an obvious case of replication. More detailed analysis using standard errors is unnecessary.
INSTRUCTIONS – Part 1: attempting control
- Use SPSS to access an appropriate dataset and run a bivariate regression predicting a Dependent with one Independent variable.
- Note the significance of the equation, the adjusted R-square as well as both standardized and unstandardized regression coefficients.
- Edit your regression syntax to add a control variable by adding a second /method enter = subcommand, naming the second IV.
- Examine the strength and significance of both Independent Variables as well as the adjusted R-square.
- With reference to the all the relevant statistics determine whether the results indicate replication or not.
- Carefully explain what factors led you to your conclusion.
- Proceed to Part 2
Part 2:
USING CONTROL VARIABLES TO UNDERSTAND SPURIOUS RELATIONSHIPS: Explanation and Interpretation
-
- While the most common result using statistical control variables in regression analysis is replication, other more theoretically interesting results are certainly possible.
- If the relationship between an X variable and a Y variable in a regression analysis is substantially weaker after introducing a control variable into the equation, then the relationship may be either partly or wholly spurious. Recall that in a spurious relationship, the original relationship is revealed to be due to the influence of a third variable (Z) used as a control. We can often better understand statistically spurious relationships by analyzing the theoretical relations among the three variables.
- More specifically, in understanding a spurious relationship it is often useful to determine whether the control variable is theoretically antecedent to the other two variables or intervening between them.
- An antecedent variable is logically (or temporally) prior to both of the original X and Y variables. Where the control variable is antecedent to both the independent and dependent variables, the finding of a spurious relationship is termed “explanation.”
Symbolically, Z –> X,Y.
The idea is that the control variable explains why X and Y are related. With explanation, X and Y are not related because they are a cause and an effect; they are related because both X and Y are affected by Z. - An intervening variable is one that is logically (or temporally) prior to one of the variables, but not both of them. Where the control variable is intervening, a finding of a spurious relationship is called “interpretation” because it clarifies (in whole or part) the process through which the relationship between X and Y functions. Many authors call this mediation as the control variable is theorized to mediate between an independent and dependent variable. This is represented symbolically as:
X–> Z –> Y- In some cases, certain variables could not plausibly be considered to be antecedent. For example, in the case of a spurious relationship between gender and income the control variable, education cannot possibly be antecedent to both gender and income because education cannot plausibly be a cause of gender.
- In other cases, you can determine whether the control variable is intervening or antecedent only through theorizing which variable(s) is causally prior. It can take some very careful thinking to sort through the logical or theoretical ordering of our variables in deciding whether a spurious relationship should be regarded as explanation or interpretation.
- There are, of course, other possible outcomes in using statistical controls such as suppression and distortion as mentioned in lecture. These will not be explored here, but a quick summary is available in Lab 20.
- An antecedent variable is logically (or temporally) prior to both of the original X and Y variables. Where the control variable is antecedent to both the independent and dependent variables, the finding of a spurious relationship is termed “explanation.”
EXAMPLE – 3: Interpretation and Explanation
Preliminary Hypothesis: Ideology interprets (mediates) the relationship between Party Affiliation and Attitudes toward Recreational Marijuana.
As before, the focus is on the relationship between Party Identification and attitudes toward Marijuana. This is what some authors would call the focal relationship. The following regression analysis examines this relationship while controlling for Ideology as measured by Liberal5. Here’s the syntax:
regression variables=RawMJ3 Democrat5 liberal5
/statistics anova coeff r tol
/descriptives = n
/dependent = RawMJ3
/method = enter Democrat5
/method = enter liberal5.
And here’s a summary of the output presented in tabular form:
Predicting Attitudes toward Recreational Marijuana
with Party Preference (Democrat5) & Ideology (Liberal5)
(Standardized coefficients)
Model 2 | ||||
Model 1 | ||||
democrat5 | .212*** | .054 | ||
liberal5 | .323*** | |||
Adj R2 | .044 | .122 | ||
N = | (950) | (950) |
Once again, there are two models: the first is of the bivariate relationship between Party Affiliation and Attitudes toward Recreational Marijuana. It shows essentially the same finding as summarized by the first model in each of the foregoing analyses. (Note: small variations in the coefficients are due to differing numbers of missing cases on the control variables as reflected in sample sizes.) Here, however, the second model controls for Political Ideology. Again the R-square value increases, this time more dramatically from .04 to .12. Perhaps more interestingly, the Beta for Party Affiliation plummets, from roughly .2 in Model 1 to .05 in Model 2. Moreover, the relationship between Party Affiliation and Marijuana Attitudes in Model 2 is no longer significant. So controlling for Ideology makes the original focal relationship essentially disappear. Meanwhile, the Beta coefficient for Ideology is both relatively strong and significant. (Using the syntax provided the reader can confirm that the unstandardized coefficient drops nearly 5 times its standard error).
At a theoretical level, since there is no evident causal ordering of Party and Ideology, these results can be interpreted as either explanation or interpretation. Explanation would suggest that Party Affiliation and Marijuana Attitudes are both the result of Political Ideology. Interpretation would suggest that Party Affiliation leads to Political Ideology which in turn leads to Attitudes regarding Marijuana.
Recall that explanation can be depicted as Z–>X, Y, whereas Interpretation (or mediation) is conventionally shown as
X–>Z–>Y.
In this instance, Explanation and Interpretation are both plausible accounts of the obtained results. They can not be distinguished empirically because neither Party Affiliation nor Ideology is logically prior to the other. Nevertheless, a substantial line of theoretical work in the field of political science suggests that partisanship is temporally prior to Ideology insofar as children learn at a young age to support the political party favored by their parents, long before anything resembling a political ideology develops. This would seem to support interpretation as more appropriate for these results. Moreover, an experimental effort to disentangle partisanship and Ideology also suggests party is causally prior to ideology. disentangling-party & ideology
INSTRUCTIONS – Part 2: using additional control variables
- Using a data set of interest select a dependent variable, an independent variable, and a control variable that may help you to explain or interpret the original IV-DV relationship.
- Select appropriate measures of association and statistical significance.:
- Perform the simple bivariate regression, noting the strength of the original relationship by looking at the standardized as well as the unstandardized coefficients and checking their statistical significance.
- Next add a control variable to the analysis by adding a second /method = enter subcommand your syntax. Note that this control variable must also be listed among the variables listed on first line of the regression command.
- Examine the strength and significance of both the first and second independent variables now in the equation. Pay particular attention to whether the regression coefficient for the first independent has substantially changed. If it hasn’t you have replication. If it has markedly decreased in either magnitude or significance, you may have identified a instance of explanation or interpretation. Be sure also to consider the unstandardized coefficient and its standard error to be sure any change you observe is greater than what can be attributed to chance.
- Think carefully through the theoretical ordering of your variables to determine whether any decrease in the strength of your original independent variable as a predictor should be viewed as explanation or interpretation.
- Explain what theoretical factors led you to your conclusion.
- Try a second control variable.
QUESTIONS FOR REFLECTION
- Try to understand each step of each regression you run in order to decide whether the results indicate replication, interpretation or explanation.
- Remember that just as the artist sculpts with the brain not only the chisel, the data analyst must compute with the brain not only the computer.