terms_dfm
takes a text source with text objects associated with unique
document identifiers and creates a document-feature-matrix, which can be used
as input for an stm
topic modeller.
terms_dfm(textData, textColumn, documentIdColumn, removeStopwords = FALSE, removeNumbers = FALSE, wordStemming = FALSE, customStopwords = NULL)
textData | a dataframe containing the text to be processed, with each row representing a distinct document |
---|---|
textColumn | the column name in |
documentIdColumn | the column name in |
removeStopwords | a Boolean indicating whether standard stopwords (see
|
removeNumbers | a Boolean indicating whether numbers should be removed
from the result; default is FALSE. If TRUE, a the Porter
stemmer from the |
wordStemming | a Boolean indicating whether words in the text should be reduced to the word stem; default is FALSE. |
customStopwords | a character vector specifying additional stopwords that should be removed from the result |
a document-feature-matrix of type
quanteda::dfm
(similar to a
document-term-matrix), where a document is identified by the value
in the documentIdColumn
specified in the text source (i.e.
textData
), and a feature or term is a character
sequence obtained after tokenization and all other NLP processing options
have been applied to the text associated with a document.
Text input (textColumn
) is split with a word tokenizer and
tokens are further processed and filtered according to the function's
options. Since the result is primarily intended as input for a topic
modeller, stopwords (see tidytext
) are
not removed by default.