Split a text source into tokens and terms by date of occurrence

These functions transform a text source into a dataframe of individual terms and tokens with an occurrence date. These terms/tokens can be extracted as ngrams of specified length. terms_by_date is wrapper around the function for specific types of ngrams.

terms_by_date(textData, textColumn, dateColumn, removeNumbers = TRUE,
  wordStemming = TRUE, customStopwords = NULL, tokenType = "unigram")

unigrams_by_date(textData, textColumn, dateColumn, removeNumbers = TRUE,
  wordStemming = TRUE, customStopwords = NULL)

bigrams_by_date(textData, textColumn, dateColumn, removeNumbers = TRUE,
  wordStemming = TRUE, customStopwords = NULL)

Arguments

textData	a dataframe containing the text to be processed
textColumn	a character string specifying the column name in `textData` containing the text to be processed
dateColumn	a character string specifying the column name in `textData` specifying a publication date for the text in `textColumn`
removeNumbers	a Boolean indicating whether numbers should be removed from the result; default is TRUE.
wordStemming	a Boolean indicating whether words in the text should be reduced to the word stem; default is TRUE.
customStopwords	a character vector specifying additional stopwords that should be removed from the result
tokenType	the length of the consecutive token sequence extracted, currently only `bigram` (two word sequence) and `unigram` (single words) are supported, with `unigram` as default

Value

a dataframe with three columns listing all individual term occurrences in the provided text source, where occur is the publication date associated with an original token, which has been processed/reduced to term; if no stemming has been applied the term and token in the result are identical

Details

Text input (textColumn) is split with a word tokenizer, default stopwords (see tidytext) are removed and tokens are further processed and filtered according to the function's options. A term is the character sequence obtained after all NLP processing options this function offers have been applied, most importantly stemming, here the Porter stemmer from the SnowballC package is applied.