R/term-extract.R
extract-term-ngrams.RdThese functions transform a text source into a dataframe of individual terms
and tokens with an occurrence date. These terms/tokens can be extracted as
ngrams of specified length. terms_by_date is wrapper around the
function for specific types of ngrams.
terms_by_date(textData, textColumn, dateColumn, removeNumbers = TRUE, wordStemming = TRUE, customStopwords = NULL, tokenType = "unigram") unigrams_by_date(textData, textColumn, dateColumn, removeNumbers = TRUE, wordStemming = TRUE, customStopwords = NULL) bigrams_by_date(textData, textColumn, dateColumn, removeNumbers = TRUE, wordStemming = TRUE, customStopwords = NULL)
| textData | a dataframe containing the text to be processed |
|---|---|
| textColumn | a character string specifying the column name in
|
| dateColumn | a character string specifying the column name in
|
| removeNumbers | a Boolean indicating whether numbers should be removed from the result; default is TRUE. |
| wordStemming | a Boolean indicating whether words in the text should be reduced to the word stem; default is TRUE. |
| customStopwords | a character vector specifying additional stopwords that should be removed from the result |
| tokenType | the length of the consecutive token sequence extracted,
currently only |
a dataframe with three columns listing all individual term
occurrences in the provided text source, where occur is the
publication date associated with an original token, which has been
processed/reduced to term; if no stemming has been applied the term
and token in the result are identical
Text input (textColumn) is split with a word tokenizer, default
stopwords (see tidytext) are removed and
tokens are further processed and filtered according to the function's
options. A term is the character sequence obtained after all NLP
processing options this function offers have been applied, most importantly
stemming, here the Porter stemmer from the
SnowballC package is applied.