R/term-extract.R
extract-term-ngrams.Rd
These functions transform a text source into a dataframe of individual terms
and tokens with an occurrence date. These terms/tokens can be extracted as
ngrams of specified length. terms_by_date
is wrapper around the
function for specific types of ngrams.
terms_by_date(textData, textColumn, dateColumn, removeNumbers = TRUE, wordStemming = TRUE, customStopwords = NULL, tokenType = "unigram") unigrams_by_date(textData, textColumn, dateColumn, removeNumbers = TRUE, wordStemming = TRUE, customStopwords = NULL) bigrams_by_date(textData, textColumn, dateColumn, removeNumbers = TRUE, wordStemming = TRUE, customStopwords = NULL)
textData | a dataframe containing the text to be processed |
---|---|
textColumn | a character string specifying the column name in
|
dateColumn | a character string specifying the column name in
|
removeNumbers | a Boolean indicating whether numbers should be removed from the result; default is TRUE. |
wordStemming | a Boolean indicating whether words in the text should be reduced to the word stem; default is TRUE. |
customStopwords | a character vector specifying additional stopwords that should be removed from the result |
tokenType | the length of the consecutive token sequence extracted,
currently only |
a dataframe with three columns listing all individual term
occurrences in the provided text source, where occur
is the
publication date associated with an original token
, which has been
processed/reduced to term
; if no stemming has been applied the term
and token in the result are identical
Text input (textColumn
) is split with a word tokenizer, default
stopwords (see tidytext
) are removed and
tokens are further processed and filtered according to the function's
options. A term is the character sequence obtained after all NLP
processing options this function offers have been applied, most importantly
stemming, here the Porter stemmer from the
SnowballC package
is applied.