terms_dfm takes a text source with text objects associated with unique
document identifiers and creates a document-feature-matrix, which can be used
as input for an stm topic modeller.
terms_dfm(textData, textColumn, documentIdColumn, removeStopwords = FALSE, removeNumbers = FALSE, wordStemming = FALSE, customStopwords = NULL)
| textData | a dataframe containing the text to be processed, with each row representing a distinct document |
|---|---|
| textColumn | the column name in |
| documentIdColumn | the column name in |
| removeStopwords | a Boolean indicating whether standard stopwords (see
|
| removeNumbers | a Boolean indicating whether numbers should be removed
from the result; default is FALSE. If TRUE, a the Porter
stemmer from the |
| wordStemming | a Boolean indicating whether words in the text should be reduced to the word stem; default is FALSE. |
| customStopwords | a character vector specifying additional stopwords that should be removed from the result |
a document-feature-matrix of type
quanteda::dfm (similar to a
document-term-matrix), where a document is identified by the value
in the documentIdColumn specified in the text source (i.e.
textData), and a feature or term is a character
sequence obtained after tokenization and all other NLP processing options
have been applied to the text associated with a document.
Text input (textColumn) is split with a word tokenizer and
tokens are further processed and filtered according to the function's
options. Since the result is primarily intended as input for a topic
modeller, stopwords (see tidytext) are
not removed by default.