Create a document-feature-matrix from a text source

terms_dfm takes a text source with text objects associated with unique document identifiers and creates a document-feature-matrix, which can be used as input for an stm topic modeller.

terms_dfm(textData, textColumn, documentIdColumn,
  removeStopwords = FALSE, removeNumbers = FALSE,
  wordStemming = FALSE, customStopwords = NULL)

Arguments

textData	a dataframe containing the text to be processed, with each row representing a distinct document
textColumn	the column name in `textData` containing the text to be processed
documentIdColumn	the column name in `textData` specifying a unique identifier for the document with the content given in `textColumn`
removeStopwords	a Boolean indicating whether standard stopwords (see `tidytext`) should be removed from the result; default is FALSE.
removeNumbers	a Boolean indicating whether numbers should be removed from the result; default is FALSE. If TRUE, a the Porter stemmer from the `SnowballC package` is applied.
wordStemming	a Boolean indicating whether words in the text should be reduced to the word stem; default is FALSE.
customStopwords	a character vector specifying additional stopwords that should be removed from the result

Value

a document-feature-matrix of type quanteda::dfm (similar to a document-term-matrix), where a document is identified by the value in the documentIdColumn specified in the text source (i.e. textData), and a feature or term is a character sequence obtained after tokenization and all other NLP processing options have been applied to the text associated with a document.

Details

Text input (textColumn) is split with a word tokenizer and tokens are further processed and filtered according to the function's options. Since the result is primarily intended as input for a topic modeller, stopwords (see tidytext) are not removed by default.