Try entering some basic calculations!
To work with data, we use objects. Objects are stored in the current environment. Single values, data structures and functions are all objects.
To access an object, it needs a name. This name should start with a letter (after that, you’re free to use whatever combination of letters, numbers and some special characters (like dots or underscores) you like).
Execute the line above by placing the cursor somewhere in it and pressing CTRL+Enter (or by clicking “Run” in RStudio).
Notice that it will now appear in the environment.
You can now access the object’s value by entering its name:
## [1] 42
Create a few more objects. You can also assign the value of an existing object to a new object (thereby copying it), do calculations with objects and assign the result of these to objects. (What happens when you multiply an object by itself and assign the result to the same object?)
You can also see the objects in the current environment by executing this function:
## [1] "my_first_object"
To get rid of an object, use rm():
To get rid of all objects in the current environment:
Functions like these can be recognised by the round brackets.
A function has a name (the characters before the brackets) and may have different arguments (between the brackets), separated by commas. Arguments may be either optional or obligatory.
For example, the following function produces five random numbers between 1 and 10:
## [1] 9 5 10 7 4
To take a sample with replacement (allowing the same number to be drawn multiple times), we can specifiy the optional argument replace:
## [1] 7 1 1 9 3 4 10 1 10 6 2
To look up what a function does and how it works, you can access the built-in documentation by typing ? followed by the function’s name: ?sample
If you enter a function’s arguments in the exact same order as seen in its documentation, you don’t need to specify the names of its arguments. If you do specify them, however, you are free to enter them in any order you want:
## [1] 64 46 43 38 56
R already provides quite a lot of functions, but sooner or later, you’ll need some more …
A package is a collection of functions and/or data sets, usually for a certain range of applications (e.g. plotting, linear mixed-effects models, corpus analysis, …).
When packages are installed, they are stored locally (e.g. on a hard drive). The set of installed packages can be thought of as a library: if you need a certain package in your current session, you can check it out (thus activating it).
To install a package (or several, by providing a vector of package names): install.packages(“name_of_package”) install.packages(c(“package1”, “package2”, “package3”))
By default, dependencies are also installed (= packages which are required for your new package to work properly).
To activate an installed package: library(“name_of_package”) library(name_of_package)
(For whatever reason, quotation marks are optional in this case.)
You can also use RStudio to install, update, activate and deactivate packages.
A much more extensive tutorial (useful even for advanced users): https://www.datacamp.com/community/tutorials/rpackages-guide
We’ll need some packages later, so let’s activate them:
## -- Attaching packages ------------------------------------------------------------------------ tidyverse 1.2.1.9000 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts -------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
To check if values are equal, if one is greater than another etc., we need logical operators.
Are a and b equal?
## [1] FALSE
Is a greater than b?
## [1] TRUE
Also:
## [1] TRUE
## [1] FALSE
## [1] FALSE
## [1] TRUE
On its own, the exclamation mark (“not”) negates an expression
## [1] TRUE
We can use & (AND) and | (OR) to combine conditions:
## [1] TRUE
## [1] TRUE
If we want to know if only one of both sides is TRUE, we need XOR (excluding OR):
## [1] TRUE
## [1] FALSE
Side note: There’s also && und || which behave somewhat differently – you’ll need these for if statements.
There are lots of types of data in R. Luckily, we won’t need all of them.
These are the most important basic types:
(Use typeof() to determine the basic type of objects.)
Data structures:
Probably the most import data structure in R, a vector contains elements of the same basic type (for different types, you’ll need lists).
## [1] 3.0 4.6 64.0 42.0
Vectors of characters/strings or logical values are also possible:
## [1] "colourless" "green" "ideas"
## [1] TRUE FALSE TRUE TRUE FALSE
To get the number of elements in a vector, use length()
:
## [1] 4
We can use square brackets to access elements of a vector:
## [1] 3
## [1] 4.6
## [1] "green"
We can also use vectors of numbers to access several elements:
## [1] 3.0 4.6 64.0
## [1] FALSE TRUE TRUE
## [1] "colourless" "ideas"
Vectors can be part of new vectors:
## [1] 3.0 4.6 64.0 42.0 48.0 120.0 5.0 32.0
To sort a vector, use sort()
:
## [1] 3.0 4.6 5.0 32.0 42.0 48.0 64.0 120.0
## [1] 120.0 64.0 48.0 42.0 32.0 5.0 4.6 3.0
Mathematical operators and many functions are vectorised which means that when applied to a vector, you get a vector in return:
## [1] 5.0 6.6 66.0 44.0 50.0 122.0 7.0 34.0
## [1] 9.0 13.8 192.0 126.0 144.0 360.0 15.0 96.0
## [1] 1.732051 2.144761 8.000000 6.480741 6.928203 10.954451 2.236068
## [8] 5.656854
## [1] 3 5 64 42 48 120 5 32
## [1] 3.0 4.6 64.0 42.0 48.0 120.0 5.0 32.0
R provides some useful methods to search inside of vectors:
Use %in%
to check if a vector contains a certain value:
## [1] TRUE
which()
returns the position(!) of elements that meet your conditions (by using logical operators, see above):
## [1] 4
## [1] 3 4 5 6
Since this is a vector itself, you can use it to access the elements of the original vector by their position:
## [1] 64 42 48 120
But this might be a little easier:
## [1] 64 42 48 120
You can also combine conditions:
## [1] 42 48
We can import data in a number of ways. R prefers CSV files, but there are packages to read in other file formats (Excel, SPSS, JSON, etc.).
Those with some R experience probably already know read.table()
, read.csv()
, read.csv2()
etc. Alternatively, you can read in a data set as a tibble which is a little faster. For really large files, the data.table
package offers the function fread()
.
Which function is best suited to read in a specific file depends on the file format and the file’s formatting (field separator, decimal point, etc.). (Exception: fread()
. fread()
doesn’t care and usually figures this out by itself.)
For CSV files in European format (semicolon as field separator, comma as decimal point), use read_csv2()
:
## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## Parsed with column specification:
## cols(
## Lemma = col_character(),
## s.Genitiv = col_double(),
## es.Genitiv = col_double()
## )
## # A tibble: 17,512 x 3
## Lemma s.Genitiv es.Genitiv
## <chr> <dbl> <dbl>
## 1 Leben 3761 0
## 2 Blog 2570 0
## 3 Internet 1847 0
## 4 Artikel 1757 0
## 5 Erachten 1666 0
## 6 Monat 1562 6
## 7 Spiel 1479 192
## 8 Wissen 1463 0
## 9 Unternehmen 1260 0
## 10 Film 1241 265
## # ... with 17,502 more rows
To specify data types for certain columns:
## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## # A tibble: 17,512 x 3
## Lemma s.Genitiv es.Genitiv
## <chr> <int> <int>
## 1 Leben 3761 0
## 2 Blog 2570 0
## 3 Internet 1847 0
## 4 Artikel 1757 0
## 5 Erachten 1666 0
## 6 Monat 1562 6
## 7 Spiel 1479 192
## 8 Wissen 1463 0
## 9 Unternehmen 1260 0
## 10 Film 1241 265
## # ... with 17,502 more rows
The first argument is a file path. Since the folder “data” is located in my current working directory, I don’t need to specify the full/absolute path.
Alternatively, you can use file.choose()
to select a file: read_csv2(file.choose())
Try it out!
There’s also read_csv() for classic CSV files (comma as field separator, .
as decimal point), read_tsv() for files with tab stops as field separators, and read_delim(), the parent function where you can specify everything yourself.
RStudio offers some options to read in files a little more comfortably: File -> Import Dataset
Having selected fitting options to import your data and having clicked “Import”, you can see the R command on the console. You can then copy it to your script to speed up the process in the future.
Example: Opening an Excel file:
## # A tibble: 17,512 x 3
## Lemma s.Genitiv es.Genitiv
## <chr> <dbl> <dbl>
## 1 Leben 3761 0
## 2 Blog 2570 0
## 3 Internet 1847 0
## 4 Artikel 1757 0
## 5 Erachten 1666 0
## 6 Monat 1562 6
## 7 Spiel 1479 192
## 8 Wissen 1463 0
## 9 Unternehmen 1260 0
## 10 Film 1241 265
## # ... with 17,502 more rows
If you’ve got your own data with you, now’s the time to try to open it! #### Accessing parts of a data set To access a column (usually a statistical variable), enter the data set’s name, followed by a Dollar sign and the name of the column. We get a vector of values (let’s not display all of them by using head()
):
## [1] "Leben" "Blog" "Internet" "Artikel" "Erachten" "Monat"
## [1] 3761 2570 1847 1757 1666 1562
Just as with vectors, you can use square brackets to subset a data set. You just have to provide two values: row and column.
## # A tibble: 1 x 1
## es.Genitiv
## <dbl>
## 1 0
## # A tibble: 1 x 2
## Lemma es.Genitiv
## <chr> <dbl>
## 1 Internet 0
## # A tibble: 1 x 3
## Lemma s.Genitiv es.Genitiv
## <chr> <dbl> <dbl>
## 1 Internet 1847 0
## # A tibble: 17,512 x 1
## es.Genitiv
## <dbl>
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 6
## 7 192
## 8 0
## 9 0
## 10 265
## # ... with 17,502 more rows
To select certain columns, select()
is also useful:
## # A tibble: 17,512 x 3
## es.Genitiv s.Genitiv Lemma
## <dbl> <dbl> <chr>
## 1 0 3761 Leben
## 2 0 2570 Blog
## 3 0 1847 Internet
## 4 0 1757 Artikel
## 5 0 1666 Erachten
## 6 6 1562 Monat
## 7 192 1479 Spiel
## 8 0 1463 Wissen
## 9 0 1260 Unternehmen
## 10 265 1241 Film
## # ... with 17,502 more rows
## # A tibble: 17,512 x 2
## Lemma s.Genitiv
## <chr> <dbl>
## 1 Leben 3761
## 2 Blog 2570
## 3 Internet 1847
## 4 Artikel 1757
## 5 Erachten 1666
## 6 Monat 1562
## 7 Spiel 1479
## 8 Wissen 1463
## 9 Unternehmen 1260
## 10 Film 1241
## # ... with 17,502 more rows
You can also rename variables:
## # A tibble: 17,512 x 3
## Lemma s_genitive es_genitive
## <chr> <dbl> <dbl>
## 1 Leben 3761 0
## 2 Blog 2570 0
## 3 Internet 1847 0
## 4 Artikel 1757 0
## 5 Erachten 1666 0
## 6 Monat 1562 6
## 7 Spiel 1479 192
## 8 Wissen 1463 0
## 9 Unternehmen 1260 0
## 10 Film 1241 265
## # ... with 17,502 more rows
If you just want to rename a column while keeping all other columns, rename()
might be more practical:
## # A tibble: 17,512 x 3
## Lemma s_genitive es.Genitiv
## <chr> <dbl> <dbl>
## 1 Leben 3761 0
## 2 Blog 2570 0
## 3 Internet 1847 0
## 4 Artikel 1757 0
## 5 Erachten 1666 0
## 6 Monat 1562 6
## 7 Spiel 1479 192
## 8 Wissen 1463 0
## 9 Unternehmen 1260 0
## 10 Film 1241 265
## # ... with 17,502 more rows
select()
is also useful to change the order of columns:
## # A tibble: 17,512 x 3
## s.Genitiv Lemma es.Genitiv
## <dbl> <chr> <dbl>
## 1 3761 Leben 0
## 2 2570 Blog 0
## 3 1847 Internet 0
## 4 1757 Artikel 0
## 5 1666 Erachten 0
## 6 1562 Monat 6
## 7 1479 Spiel 192
## 8 1463 Wissen 0
## 9 1260 Unternehmen 0
## 10 1241 Film 265
## # ... with 17,502 more rows
You’ll often want to get parts of a data set not according to their position, but according to certain conditions which must be fulfilled. That’s what filter()
is for.
gen_blogs has 17512 rows – let’s just use the lemmas which appear at least five times in any form (arbitrary choice):
## # A tibble: 4,360 x 3
## Lemma s.Genitiv es.Genitiv
## <chr> <dbl> <dbl>
## 1 Leben 3761 0
## 2 Blog 2570 0
## 3 Internet 1847 0
## 4 Artikel 1757 0
## 5 Erachten 1666 0
## 6 Monat 1562 6
## 7 Spiel 1479 192
## 8 Wissen 1463 0
## 9 Unternehmen 1260 0
## 10 Film 1241 265
## # ... with 4,350 more rows
If several conditions have to be fulfilled, they can be separated by commas:
## # A tibble: 16 x 3
## Lemma s.Genitiv es.Genitiv
## <chr> <dbl> <dbl>
## 1 Spiel 1479 192
## 2 Film 1241 265
## 3 Projekt 1215 725
## 4 Beitrag 757 281
## 5 Licht 406 128
## 6 Begriff 376 171
## 7 Vortrag 307 135
## 8 Gerät 242 366
## 9 Netzwerk 181 104
## 10 Bundestag 164 367
## 11 Produkt 136 189
## 12 Werk 133 348
## 13 Vertrag 126 189
## 14 Verlag 125 124
## 15 Widerstand 122 106
## 16 Protest 107 110
Logical AND works the same way:
## # A tibble: 16 x 3
## Lemma s.Genitiv es.Genitiv
## <chr> <dbl> <dbl>
## 1 Spiel 1479 192
## 2 Film 1241 265
## 3 Projekt 1215 725
## 4 Beitrag 757 281
## 5 Licht 406 128
## 6 Begriff 376 171
## 7 Vortrag 307 135
## 8 Gerät 242 366
## 9 Netzwerk 181 104
## 10 Bundestag 164 367
## 11 Produkt 136 189
## 12 Werk 133 348
## 13 Vertrag 126 189
## 14 Verlag 125 124
## 15 Widerstand 122 106
## 16 Protest 107 110
Try to …
s.Genitiv
is exactly 100Besides “Äußer”, there are some other lemmas in the data set which shouldn’t be in there.
Let’s throw them out by using %in%
:
gen_blogs <- gen_blogs %>% filter(!(Lemma %in%
c("Äußer", "Inner", "Wichtiger",
"Schlimmer", "Besser", "Neu")))
More cleanup, using string functions and regular expressions:
Words ending in -nis have been improperly lemmatised (-niss):
## [1] "Bündniss" "Ereigniss"
## [3] "Verhältniss" "Ergebniss"
## [5] "Verständniss" "Aktionsbündniss"
## [7] "Gedächtniss" "Selbstverständniss"
## [9] "Verzeichniss" "Wahlergebniss"
## [11] "Gefängniss" "Bedürfniss"
## [13] "Arbeitsverhältniss" "Bekenntniss"
## [15] "Geheimniss" "Wahlgeheimniss"
## [17] "Beschäftigungsverhältniss" "Geständniss"
## [19] "Missverständniss" "Bankgeheimniss"
## [21] "Kapitalverhältniss" "Erlebniss"
## [23] "Unverständniss" "Briefgeheimniss"
## [25] "Presseerzeugniss" "Fernmeldegeheimniss"
## [27] "Vertragsverhältniss" "Einverständniss"
## [29] "Gleichniss" "Inhaltsverzeichniss"
## [31] "Mietverhältniss" "Arbeitsgedächtniss"
## [33] "Begräbniss" "Jahrhundertereigniss"
## [35] "Textverständniss" "Untersuchungsergebniss"
## [37] "Verhängniss" "Ärgerniss"
There are also very few lemmas with non-alphanumerical characters at the end:
If you want to add a column to an existing data.frame, tibble or data.table, the vector needs to have the same length as the other columns.
There are quite a few ways to do this. Imho, the easiest one is this:
## # A tibble: 4,354 x 4
## Lemma s.Genitiv es.Genitiv Length
## <chr> <dbl> <dbl> <int>
## 1 Leben 3761 0 5
## 2 Blog 2570 0 4
## 3 Internet 1847 0 8
## 4 Artikel 1757 0 7
## 5 Erachten 1666 0 8
## 6 Monat 1562 6 5
## 7 Spiel 1479 192 5
## 8 Wissen 1463 0 6
## 9 Unternehmen 1260 0 11
## 10 Film 1241 265 4
## # ... with 4,344 more rows
mutate()
can be used to add several columns at once, to change existing columns, and to do calculations with columns:
gen_blogs <- gen_blogs %>% mutate(Total = s.Genitiv + es.Genitiv,
Frac_es = round(es.Genitiv / Total, 2))
gen_blogs
## # A tibble: 4,354 x 6
## Lemma s.Genitiv es.Genitiv Length Total Frac_es
## <chr> <dbl> <dbl> <int> <dbl> <dbl>
## 1 Leben 3761 0 5 3761 0
## 2 Blog 2570 0 4 2570 0
## 3 Internet 1847 0 8 1847 0
## 4 Artikel 1757 0 7 1757 0
## 5 Erachten 1666 0 8 1666 0
## 6 Monat 1562 6 5 1568 0
## 7 Spiel 1479 192 5 1671 0.11
## 8 Wissen 1463 0 6 1463 0
## 9 Unternehmen 1260 0 11 1260 0
## 10 Film 1241 265 4 1506 0.18
## # ... with 4,344 more rows
# install.packages("sylly")
# install.packages("sylly.de", repo="https://undocumeantit.github.io/repos/l10n")
optional step: new column with the number of syllables
## Loading required package: sylly
## Hyphenation (language: de)
##
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|== | 4%
|
|=== | 4%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|==== | 7%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|====== | 10%
|
|======= | 10%
|
|======= | 11%
|
|======= | 12%
|
|======== | 12%
|
|======== | 13%
|
|========= | 13%
|
|========= | 14%
|
|========= | 15%
|
|========== | 15%
|
|========== | 16%
|
|=========== | 16%
|
|=========== | 17%
|
|=========== | 18%
|
|============ | 18%
|
|============ | 19%
|
|============= | 19%
|
|============= | 20%
|
|============= | 21%
|
|============== | 21%
|
|============== | 22%
|
|=============== | 22%
|
|=============== | 23%
|
|=============== | 24%
|
|================ | 24%
|
|================ | 25%
|
|================= | 25%
|
|================= | 26%
|
|================= | 27%
|
|================== | 27%
|
|================== | 28%
|
|=================== | 28%
|
|=================== | 29%
|
|=================== | 30%
|
|==================== | 30%
|
|==================== | 31%
|
|==================== | 32%
|
|===================== | 32%
|
|===================== | 33%
|
|====================== | 33%
|
|====================== | 34%
|
|====================== | 35%
|
|======================= | 35%
|
|======================= | 36%
|
|======================== | 36%
|
|======================== | 37%
|
|======================== | 38%
|
|========================= | 38%
|
|========================= | 39%
|
|========================== | 39%
|
|========================== | 40%
|
|========================== | 41%
|
|=========================== | 41%
|
|=========================== | 42%
|
|============================ | 42%
|
|============================ | 43%
|
|============================ | 44%
|
|============================= | 44%
|
|============================= | 45%
|
|============================== | 45%
|
|============================== | 46%
|
|============================== | 47%
|
|=============================== | 47%
|
|=============================== | 48%
|
|================================ | 48%
|
|================================ | 49%
|
|================================ | 50%
|
|================================= | 50%
|
|================================= | 51%
|
|================================= | 52%
|
|================================== | 52%
|
|================================== | 53%
|
|=================================== | 53%
|
|=================================== | 54%
|
|=================================== | 55%
|
|==================================== | 55%
|
|==================================== | 56%
|
|===================================== | 56%
|
|===================================== | 57%
|
|===================================== | 58%
|
|====================================== | 58%
|
|====================================== | 59%
|
|======================================= | 59%
|
|======================================= | 60%
|
|======================================= | 61%
|
|======================================== | 61%
|
|======================================== | 62%
|
|========================================= | 62%
|
|========================================= | 63%
|
|========================================= | 64%
|
|========================================== | 64%
|
|========================================== | 65%
|
|=========================================== | 65%
|
|=========================================== | 66%
|
|=========================================== | 67%
|
|============================================ | 67%
|
|============================================ | 68%
|
|============================================= | 68%
|
|============================================= | 69%
|
|============================================= | 70%
|
|============================================== | 70%
|
|============================================== | 71%
|
|============================================== | 72%
|
|=============================================== | 72%
|
|=============================================== | 73%
|
|================================================ | 73%
|
|================================================ | 74%
|
|================================================ | 75%
|
|================================================= | 75%
|
|================================================= | 76%
|
|================================================== | 76%
|
|================================================== | 77%
|
|================================================== | 78%
|
|=================================================== | 78%
|
|=================================================== | 79%
|
|==================================================== | 79%
|
|==================================================== | 80%
|
|==================================================== | 81%
|
|===================================================== | 81%
|
|===================================================== | 82%
|
|====================================================== | 82%
|
|====================================================== | 83%
|
|====================================================== | 84%
|
|======================================================= | 84%
|
|======================================================= | 85%
|
|======================================================== | 85%
|
|======================================================== | 86%
|
|======================================================== | 87%
|
|========================================================= | 87%
|
|========================================================= | 88%
|
|========================================================== | 88%
|
|========================================================== | 89%
|
|========================================================== | 90%
|
|=========================================================== | 90%
|
|=========================================================== | 91%
|
|=========================================================== | 92%
|
|============================================================ | 92%
|
|============================================================ | 93%
|
|============================================================= | 93%
|
|============================================================= | 94%
|
|============================================================= | 95%
|
|============================================================== | 95%
|
|============================================================== | 96%
|
|=============================================================== | 96%
|
|=============================================================== | 97%
|
|=============================================================== | 98%
|
|================================================================ | 98%
|
|================================================================ | 99%
|
|=================================================================| 99%
|
|=================================================================| 100%
Use arrange()
to change the order of rows:
## # A tibble: 4,354 x 7
## Lemma s.Genitiv es.Genitiv Length Total Frac_es Syllables
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Jahr 6 4378 4 4384 1 1
## 2 Tag 43 3401 3 3444 0.99 1
## 3 Land 7 2659 4 2666 1 1
## 4 Buch 0 1669 4 1669 1 1
## 5 Staat 48 1585 5 1633 0.97 1
## 6 Wort 31 1062 4 1093 0.97 1
## 7 Text 43 970 4 1013 0.96 1
## 8 Kind 3 896 4 899 1 1
## 9 Volk 29 879 4 908 0.97 1
## 10 Projekt 1215 725 7 1940 0.37 2
## # ... with 4,344 more rows
desc()
to sort in descending order
You can also sort by several columns:
## # A tibble: 4,354 x 7
## Lemma s.Genitiv es.Genitiv Length Total Frac_es Syllables
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 DJ 18 0 2 18 0 1
## 2 Ei 0 7 2 7 1 1
## 3 Öl 45 5 2 50 0.1 1
## 4 Abo 22 0 3 22 0 1
## 5 Abt 0 18 3 18 1 1
## 6 Akt 0 9 3 9 1 1
## 7 All 32 0 3 32 0 1
## 8 Amt 15 186 3 201 0.93 1
## 9 Arm 4 8 3 12 0.67 1
## 10 Bad 0 20 3 20 1 1
## # ... with 4,344 more rows
## # A tibble: 4,354 x 7
## Lemma s.Genitiv es.Genitiv Length Total Frac_es Syllables
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Jugendmedienschutz-~ 11 11 32 22 0.5 9
## 2 Bundesverfassungsge~ 9 0 31 9 0 9
## 3 Mammographie-Screen~ 5 0 31 5 0 7
## 4 Jugendmedienschutzs~ 2 13 31 15 0.87 9
## 5 Urheberrechtswahrne~ 0 8 31 8 1 9
## 6 Bundesverteidigungs~ 7 0 30 7 0 11
## 7 Beschäftigtendatens~ 0 5 30 5 1 9
## 8 Bundesgesundheitsmi~ 22 0 28 22 0 10
## 9 Bundeswirtschaftsmi~ 18 0 28 18 0 9
## 10 Verbraucherschutzmi~ 11 0 28 11 0 9
## # ... with 4,344 more rows
group_by()
creates a grouped tibblesummarise
is then used for arbitrary operations (sums, means, standard deviations, …) which are performed by groupgen_blogs %>% group_by(Length) %>%
summarise(Lemma_count = n(), s_genitives = sum(s.Genitiv),
es_genitives = sum(es.Genitiv))
## # A tibble: 30 x 4
## Length Lemma_count s_genitives es_genitives
## <int> <int> <dbl> <dbl>
## 1 2 3 63 12
## 2 3 51 1223 5356
## 3 4 243 8883 21801
## 4 5 281 19506 6635
## 5 6 448 18188 3588
## 6 7 401 18062 4715
## 7 8 426 15099 2195
## 8 9 428 10238 2927
## 9 10 366 6078 1972
## 10 11 340 6441 2477
## # ... with 20 more rows
gen_blogs %>% group_by(Syllables) %>%
summarise(Lemma_count = n(), s_genitives = sum(s.Genitiv),
es_genitives = sum(es.Genitiv))
## # A tibble: 11 x 4
## Syllables Lemma_count s_genitives es_genitives
## <dbl> <int> <dbl> <dbl>
## 1 1 426 13989 34642
## 2 2 1412 56535 11310
## 3 3 1143 28994 6973
## 4 4 719 11610 2481
## 5 5 382 4210 1102
## 6 6 137 1102 642
## 7 7 89 1309 347
## 8 8 21 172 12
## 9 9 20 226 58
## 10 10 3 22 62
## 11 11 2 19 0
Does the lemma end in s, ß, z or x?
gen_blogs$Ends_in_s <- factor(ifelse(str_sub(gen_blogs$Lemma, start = -1) %in% c("s", "ß", "z", "x"), "yes", "no"))
gen_blogs
## # A tibble: 4,354 x 8
## Lemma s.Genitiv es.Genitiv Length Total Frac_es Syllables Ends_in_s
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>
## 1 Leben 3761 0 5 3761 0 2 no
## 2 Blog 2570 0 4 2570 0 1 no
## 3 Internet 1847 0 8 1847 0 3 no
## 4 Artikel 1757 0 7 1757 0 3 no
## 5 Erachten 1666 0 8 1666 0 3 no
## 6 Monat 1562 6 5 1568 0 2 no
## 7 Spiel 1479 192 5 1671 0.11 1 no
## 8 Wissen 1463 0 6 1463 0 2 no
## 9 Unternehm~ 1260 0 11 1260 0 4 no
## 10 Film 1241 265 4 1506 0.18 1 no
## # ... with 4,344 more rows
## # A tibble: 2 x 3
## Ends_in_s s es
## <fct> <dbl> <dbl>
## 1 no 118188 46660
## 2 yes 0 10969