The data set BrownLOBPassives
contains frequency information about VPs in the Brown and LOB corpora (Brown: AmE, LOB: BrE):
Passives <- BrownLOBPassives
Passives <- Passives %>% mutate(relfreq = 100 * passive / n_s)
ggplot(Passives, aes(x = lang, y = relfreq)) +
geom_boxplot(notch = TRUE) +
labs(title = "Relative frequencies of passives in American and British English",
y = "Relative frequency (in %)")
Do US-Americans use passives less frequently than Brits?
We can look at the effect of language variety by plotting a series of boxplots subdivided by genre:
bwplot(relfreq ~ lang | genre, data=Passives)
bwplot(relfreq ~ genre | lang, data=Passives, scales=list(x=list(rot=30)))
If we want to test the effect of two categorical variables on the frequency of passives at the same time, we can go for a linear model:
LM <- lm(relfreq ~ genre + lang, data=Passives)
summary(LM)
##
## Call:
## lm(formula = relfreq ~ genre + lang, data = Passives)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.107 -6.320 -1.417 5.742 45.332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.7575 1.2780 18.589 < 2e-16 ***
## genrepress editorial -0.7122 1.9398 -0.367 0.7136
## genreskills / hobbies 5.6460 1.7699 3.190 0.0015 **
## genremiscellaneous 17.6634 1.8787 9.402 < 2e-16 ***
## genrelearned 16.3499 1.4892 10.979 < 2e-16 ***
## genregeneral fiction -14.4026 1.8978 -7.589 1.21e-13 ***
## genrescience fiction -13.7013 3.4531 -3.968 8.11e-05 ***
## genreadventure -17.9596 1.8978 -9.463 < 2e-16 ***
## genreromance -17.0833 1.8978 -9.001 < 2e-16 ***
## langBrE 2.0764 0.8999 2.307 0.0214 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.22 on 612 degrees of freedom
## Multiple R-squared: 0.5941, Adjusted R-squared: 0.5882
## F-statistic: 99.54 on 9 and 612 DF, p-value: < 2.2e-16
The summary shows us the effect of each category on the response variable. We can obtain a more concise summary with the following function:
anova(LM)
## Analysis of Variance Table
##
## Response: relfreq
## Df Sum Sq Mean Sq F value Pr(>F)
## genre 8 112130 14016.2 111.3142 < 2e-16 ***
## lang 1 670 670.4 5.3238 0.02137 *
## Residuals 612 77061 125.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we can see which variables have a significant effect on the response variable.
Q: Interpret the outputs of
summary(LM)
anova(LM)
. How much of the variation is explained by the model?
plot()
can be used to visualize the residuals:
plot(LM)
Here, it looks like the residuals follow a normal distribution and are homoscedastic.
Q: Is this the case here?
The function confint(LM)
allows us to directly compute confidence intervals for the effects of the variables’ levels:
confint(LM)
## 2.5 % 97.5 %
## (Intercept) 21.2476335 26.267282
## genrepress editorial -4.5216121 3.097148
## genreskills / hobbies 2.1701713 9.121838
## genremiscellaneous 13.9739945 21.352905
## genrelearned 13.4252380 19.274524
## genregeneral fiction -18.1296653 -10.675492
## genrescience fiction -20.4826933 -6.919985
## genreadventure -21.6866815 -14.232509
## genreromance -20.8103538 -13.356181
## langBrE 0.3091067 3.843640
Q: Use
confint()
to work out which categories have a significant effect on the frequency of passives (at the level of 99%).
We can also visualize the partial effects of the explanatory variables:
plot(Effect("lang", LM))
plot(Effect(c("lang", "genre"), LM)) # combined effect
plot(Effect(c("lang", "genre"), LM), multiline=TRUE, ci.style="bars")
Linear models also enable us to model interactions between the explanatory variables:
LM.interaction <- lm(relfreq ~ genre * lang, data=Passives)
anova(LM.interaction)
## Analysis of Variance Table
##
## Response: relfreq
## Df Sum Sq Mean Sq F value Pr(>F)
## genre 8 112130 14016.2 112.6504 < 2e-16 ***
## lang 1 670 670.4 5.3877 0.02061 *
## genre:lang 8 1909 238.7 1.9183 0.05482 .
## Residuals 604 75151 124.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(Effect(c("lang", "genre"), LM.interaction), multiline=TRUE, ci.style="bars")
Q: Is the interaction term significant?
Which model is better, LM
or LM.interaction
? As well as looking at the adjusted coefficient of determination, we can also turn to a general information criteria, e.g. Akaike’s Information Criteria:
AIC(LM, LM.interaction)
## df AIC
## LM 11 4784.831
## LM.interaction 19 4785.224
The AIC measure both the goodness of fit of the model and the simplicity of the model. The lower the value, the better the model.
Q: Which model is better,
LM
orLM.interaction
? Do both AIC and adjusted R2 arrive at the same result?
Linear models can pose problems when we attempt to use them to explain corpus frequencies. For instance, in LM’s, binomial sample distributions are not modelled, nor can explanatory variables be normalized to [0,1].
Generalized Linear Models, GLMs), on the other hand, are better suited for this. We assume that the observed frequencies genuinely follow a binomial sample distribution:
\[ f_i \sim \text{Bin}(n_i, \pi_i) \] In order to normalize the probability of success (for passive) \(\pi_i\) for text \(i\) to \([0,1]\), we use a so-called link function \[ \pi_i=\dfrac{1}{1+e^{-\theta_i}} \] whose parameter \(\theta_i\) is determined by a linear model that contains the predictor variables: \[ \theta_i=\beta_0 + \beta_1(\text{genre}_i)+\beta_2(\text{language variety}_i)+\epsilon_i \]
The parameters can now be estimated from the sample via MLE. Since all this happens in the vast depths of R
, we need only transform the input appropriately.
First, we need a response matrix which contains the observed absolute passive frequencies (successes) and non-passive frequencies:
response.matrix <- cbind(Passives$passive, Passives$n_s - Passives$passive)
The generalized linear model can now be fitted using glm
. To do this, we need to enter the formula of the underlying linear model (response.matrix ~ genre + lang
) as well as the error distribution. In our example, that’s the binomial distribution. glm
automatically chooses the matching (canonical) link function (in our case it’s the so-called logistic function – the inverse function of the logit):
GLM <- glm(response.matrix ~ genre + lang, family="binomial", data=Passives)
Using plot
, we can look at model diagnostics:
plot(GLM)
Plot nr. 1: Residuals vs Fitted:
x-axis: predicted values y-axis: residuals (differences between predicted and actual values)
You mainly use this plot to check for non-linearity (in this case, the red line wouldn’t be straight).
Plot nr. 2: Normal Q-Q:
Quantile-quantile plot: Are the residuals normally distributed? Points should ideally be very close to the line.
Plot Nr. 3: Scale-Location:
This plot is similar to the first one – it’s especially suited to check the assumption of homoscedasticity. Ideally, we want a straight red line around which the points are distributed randomly and evenly.
Plot nr. 4: Residuals vs Leverage:
This plot is used to find observations which have a particularly strong influence on the regression line (by comparing how the line would look like with and without single observations).
Of interest are cases in the upper or lower right that are outside of the dotted lines.
Numbers in the plots denote row numbers of conspicuous cases. If a number is seen in several or all of the plots, you should probably look at the observation more carefully.
Here are the results:
anova(GLM, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: response.matrix
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 621 10375.1
## genre 8 7148.1 613 3227.1 < 2.2e-16 ***
## lang 1 25.8 612 3201.3 3.869e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(GLM)
##
## Call:
## glm(formula = response.matrix ~ genre + lang, family = "binomial",
## data = Passives)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.5829 -1.5141 -0.1767 1.4590 7.9327
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.21499 0.02856 -42.546 < 2e-16 ***
## genrepress editorial -0.03719 0.04254 -0.874 0.382
## genreskills / hobbies 0.28501 0.03774 7.553 4.27e-14 ***
## genremiscellaneous 0.76378 0.04109 18.588 < 2e-16 ***
## genrelearned 0.79348 0.03216 24.669 < 2e-16 ***
## genregeneral fiction -1.12015 0.04703 -23.818 < 2e-16 ***
## genrescience fiction -1.03812 0.08388 -12.376 < 2e-16 ***
## genreadventure -1.50126 0.04988 -30.097 < 2e-16 ***
## genreromance -1.40082 0.04915 -28.499 < 2e-16 ***
## langBrE 0.10406 0.02051 5.073 3.91e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10375.1 on 621 degrees of freedom
## Residual deviance: 3201.3 on 612 degrees of freedom
## AIC: 5985.4
##
## Number of Fisher Scoring iterations: 4
Confidence intervals for the effect size:
confint(GLM)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -1.27118004 -1.15923175
## genrepress editorial -0.12072090 0.04604523
## genreskills / hobbies 0.21105060 0.35898160
## genremiscellaneous 0.68326189 0.84433697
## genrelearned 0.73057914 0.85666712
## genregeneral fiction -1.21281823 -1.02844049
## genrescience fiction -1.20553672 -0.87653641
## genreadventure -1.59970620 -1.40414736
## genreromance -1.49778429 -1.30507676
## langBrE 0.06386251 0.14426712
Graphical representation of the effect size:
plot(Effect("genre", GLM), rotx=30, ylim=c(-3,0))
plot(Effect("lang", GLM), rotx=30, ylim=c(-3,0))
Interaction terms:
eff <- Effect(c("genre", "lang"), GLM)
print(eff)
##
## genre*lang effect
## lang
## genre AmE BrE
## press reportage 0.22881901 0.24769657
## press editorial 0.22232344 0.24083247
## skills / hobbies 0.28292756 0.30450645
## miscellaneous 0.38907278 0.41407252
## learned 0.39615591 0.42129692
## general fiction 0.08825391 0.09699339
## science fiction 0.09508157 0.10441974
## adventure 0.06202098 0.06835747
## romance 0.06812797 0.07503848
plot(eff, multiline=TRUE, rotx=30, ci.style="bars")
Q: Compare LM and GLM using the residual sum of squares and graphical diagnostics!