Linear models

Corpora: Brown and LOB

The data set BrownLOBPassives contains frequency information about VPs in the Brown and LOB corpora (Brown: AmE, LOB: BrE):

Passives <- BrownLOBPassives
Passives <- Passives %>% mutate(relfreq = 100 * passive / n_s)
ggplot(Passives, aes(x = lang, y = relfreq)) +
  geom_boxplot(notch = TRUE) +
  labs(title = "Relative frequencies of passives in American and British English",
       y = "Relative frequency (in %)")

Do US-Americans use passives less frequently than Brits?

Two explanatory variables

We can look at the effect of language variety by plotting a series of boxplots subdivided by genre:

bwplot(relfreq ~ lang | genre, data=Passives)

bwplot(relfreq ~ genre | lang, data=Passives, scales=list(x=list(rot=30)))

If we want to test the effect of two categorical variables on the frequency of passives at the same time, we can go for a linear model:

LM <- lm(relfreq ~ genre + lang, data=Passives)
summary(LM)
## 
## Call:
## lm(formula = relfreq ~ genre + lang, data = Passives)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.107  -6.320  -1.417   5.742  45.332 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            23.7575     1.2780  18.589  < 2e-16 ***
## genrepress editorial   -0.7122     1.9398  -0.367   0.7136    
## genreskills / hobbies   5.6460     1.7699   3.190   0.0015 ** 
## genremiscellaneous     17.6634     1.8787   9.402  < 2e-16 ***
## genrelearned           16.3499     1.4892  10.979  < 2e-16 ***
## genregeneral fiction  -14.4026     1.8978  -7.589 1.21e-13 ***
## genrescience fiction  -13.7013     3.4531  -3.968 8.11e-05 ***
## genreadventure        -17.9596     1.8978  -9.463  < 2e-16 ***
## genreromance          -17.0833     1.8978  -9.001  < 2e-16 ***
## langBrE                 2.0764     0.8999   2.307   0.0214 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.22 on 612 degrees of freedom
## Multiple R-squared:  0.5941, Adjusted R-squared:  0.5882 
## F-statistic: 99.54 on 9 and 612 DF,  p-value: < 2.2e-16

The summary shows us the effect of each category on the response variable. We can obtain a more concise summary with the following function:

anova(LM)
## Analysis of Variance Table
## 
## Response: relfreq
##            Df Sum Sq Mean Sq  F value  Pr(>F)    
## genre       8 112130 14016.2 111.3142 < 2e-16 ***
## lang        1    670   670.4   5.3238 0.02137 *  
## Residuals 612  77061   125.9                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we can see which variables have a significant effect on the response variable.

Q: Interpret the outputs of summary(LM) anova(LM). How much of the variation is explained by the model?

Model diagnostics

plot() can be used to visualize the residuals:

plot(LM)

Here, it looks like the residuals follow a normal distribution and are homoscedastic.

Q: Is this the case here?

Confidence intervals

The function confint(LM) allows us to directly compute confidence intervals for the effects of the variables’ levels:

confint(LM)
##                             2.5 %     97.5 %
## (Intercept)            21.2476335  26.267282
## genrepress editorial   -4.5216121   3.097148
## genreskills / hobbies   2.1701713   9.121838
## genremiscellaneous     13.9739945  21.352905
## genrelearned           13.4252380  19.274524
## genregeneral fiction  -18.1296653 -10.675492
## genrescience fiction  -20.4826933  -6.919985
## genreadventure        -21.6866815 -14.232509
## genreromance          -20.8103538 -13.356181
## langBrE                 0.3091067   3.843640

Q: Use confint() to work out which categories have a significant effect on the frequency of passives (at the level of 99%).

Partial effects

We can also visualize the partial effects of the explanatory variables:

plot(Effect("lang", LM))

plot(Effect(c("lang", "genre"), LM)) # combined effect

plot(Effect(c("lang", "genre"), LM), multiline=TRUE, ci.style="bars")

Interactions

Linear models also enable us to model interactions between the explanatory variables:

LM.interaction <- lm(relfreq ~ genre * lang, data=Passives)
anova(LM.interaction)
## Analysis of Variance Table
## 
## Response: relfreq
##             Df Sum Sq Mean Sq  F value  Pr(>F)    
## genre        8 112130 14016.2 112.6504 < 2e-16 ***
## lang         1    670   670.4   5.3877 0.02061 *  
## genre:lang   8   1909   238.7   1.9183 0.05482 .  
## Residuals  604  75151   124.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(Effect(c("lang", "genre"), LM.interaction), multiline=TRUE, ci.style="bars")

Q: Is the interaction term significant?

Information criteria

Which model is better, LM or LM.interaction? As well as looking at the adjusted coefficient of determination, we can also turn to a general information criteria, e.g. Akaike’s Information Criteria:

AIC(LM, LM.interaction)
##                df      AIC
## LM             11 4784.831
## LM.interaction 19 4785.224

The AIC measure both the goodness of fit of the model and the simplicity of the model. The lower the value, the better the model.

Q: Which model is better, LM or LM.interaction? Do both AIC and adjusted R2 arrive at the same result?

Generalized Linear Model (GLM)

Linear models can pose problems when we attempt to use them to explain corpus frequencies. For instance, in LM’s, binomial sample distributions are not modelled, nor can explanatory variables be normalized to [0,1].

Generalized Linear Models, GLMs), on the other hand, are better suited for this. We assume that the observed frequencies genuinely follow a binomial sample distribution:

\[ f_i \sim \text{Bin}(n_i, \pi_i) \] In order to normalize the probability of success (for passive) \(\pi_i\) for text \(i\) to \([0,1]\), we use a so-called link function \[ \pi_i=\dfrac{1}{1+e^{-\theta_i}} \] whose parameter \(\theta_i\) is determined by a linear model that contains the predictor variables: \[ \theta_i=\beta_0 + \beta_1(\text{genre}_i)+\beta_2(\text{language variety}_i)+\epsilon_i \]

The parameters can now be estimated from the sample via MLE. Since all this happens in the vast depths of R, we need only transform the input appropriately.

First, we need a response matrix which contains the observed absolute passive frequencies (successes) and non-passive frequencies:

response.matrix <- cbind(Passives$passive, Passives$n_s - Passives$passive)

The generalized linear model can now be fitted using glm. To do this, we need to enter the formula of the underlying linear model (response.matrix ~ genre + lang) as well as the error distribution. In our example, that’s the binomial distribution. glm automatically chooses the matching (canonical) link function (in our case it’s the so-called logistic function – the inverse function of the logit):

GLM <- glm(response.matrix ~ genre + lang, family="binomial", data=Passives)

Using plot, we can look at model diagnostics:

plot(GLM)

Plot nr. 1: Residuals vs Fitted:
x-axis: predicted values y-axis: residuals (differences between predicted and actual values)

You mainly use this plot to check for non-linearity (in this case, the red line wouldn’t be straight).

Plot nr. 2: Normal Q-Q:
Quantile-quantile plot: Are the residuals normally distributed? Points should ideally be very close to the line.

Plot Nr. 3: Scale-Location:
This plot is similar to the first one – it’s especially suited to check the assumption of homoscedasticity. Ideally, we want a straight red line around which the points are distributed randomly and evenly.

Plot nr. 4: Residuals vs Leverage:
This plot is used to find observations which have a particularly strong influence on the regression line (by comparing how the line would look like with and without single observations).

Of interest are cases in the upper or lower right that are outside of the dotted lines.

Numbers in the plots denote row numbers of conspicuous cases. If a number is seen in several or all of the plots, you should probably look at the observation more carefully.

Here are the results:

anova(GLM, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: response.matrix
## 
## Terms added sequentially (first to last)
## 
## 
##       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                    621    10375.1              
## genre  8   7148.1       613     3227.1 < 2.2e-16 ***
## lang   1     25.8       612     3201.3 3.869e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(GLM)
## 
## Call:
## glm(formula = response.matrix ~ genre + lang, family = "binomial", 
##     data = Passives)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.5829  -1.5141  -0.1767   1.4590   7.9327  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -1.21499    0.02856 -42.546  < 2e-16 ***
## genrepress editorial  -0.03719    0.04254  -0.874    0.382    
## genreskills / hobbies  0.28501    0.03774   7.553 4.27e-14 ***
## genremiscellaneous     0.76378    0.04109  18.588  < 2e-16 ***
## genrelearned           0.79348    0.03216  24.669  < 2e-16 ***
## genregeneral fiction  -1.12015    0.04703 -23.818  < 2e-16 ***
## genrescience fiction  -1.03812    0.08388 -12.376  < 2e-16 ***
## genreadventure        -1.50126    0.04988 -30.097  < 2e-16 ***
## genreromance          -1.40082    0.04915 -28.499  < 2e-16 ***
## langBrE                0.10406    0.02051   5.073 3.91e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10375.1  on 621  degrees of freedom
## Residual deviance:  3201.3  on 612  degrees of freedom
## AIC: 5985.4
## 
## Number of Fisher Scoring iterations: 4

Confidence intervals for the effect size:

confint(GLM)
## Waiting for profiling to be done...
##                             2.5 %      97.5 %
## (Intercept)           -1.27118004 -1.15923175
## genrepress editorial  -0.12072090  0.04604523
## genreskills / hobbies  0.21105060  0.35898160
## genremiscellaneous     0.68326189  0.84433697
## genrelearned           0.73057914  0.85666712
## genregeneral fiction  -1.21281823 -1.02844049
## genrescience fiction  -1.20553672 -0.87653641
## genreadventure        -1.59970620 -1.40414736
## genreromance          -1.49778429 -1.30507676
## langBrE                0.06386251  0.14426712

Graphical representation of the effect size:

plot(Effect("genre", GLM), rotx=30, ylim=c(-3,0))

plot(Effect("lang", GLM), rotx=30, ylim=c(-3,0))

Interaction terms:

eff <- Effect(c("genre", "lang"), GLM)
print(eff)
## 
##  genre*lang effect
##                   lang
## genre                     AmE        BrE
##   press reportage  0.22881901 0.24769657
##   press editorial  0.22232344 0.24083247
##   skills / hobbies 0.28292756 0.30450645
##   miscellaneous    0.38907278 0.41407252
##   learned          0.39615591 0.42129692
##   general fiction  0.08825391 0.09699339
##   science fiction  0.09508157 0.10441974
##   adventure        0.06202098 0.06835747
##   romance          0.06812797 0.07503848
plot(eff, multiline=TRUE, rotx=30, ci.style="bars")

Q: Compare LM and GLM using the residual sum of squares and graphical diagnostics!