Select top terms by count, trend metric and/or name pattern

select_top_terms allows to select a specified number of top terms based on miscellaneous properties of the term frequencies. This method is typically used to select term frequency time series for plotting and exploratory analysis. See the details of the function arguments for selection options.

select_top_terms(termFrequencies, topN = 25,
  selectBy = "most_frequent", selectTerms = NULL)

Arguments

termFrequencies	a dataframe of `term` frequencies as returned by `term_frequencies()`
topN	the number of returned top terms meeting the selection criteria in `selectBy`
selectBy	the selection approach which determines the metric by which `term`s will be sorted to select the `topN` terms. Currently, the following options are supported: most_frequent the default, select terms based on the total number of occurrences trending_up select terms with largest upwards trend; internally this is measured by the slope of a simple linear regression fit to a `term`'s frequency series. trending_down select terms with largest downward trend; internally this is measured by the slope of a simple linear regression fit to a `term`'s frequency series. trending select terms with either largest upward or downward trend; internally this is measured by the absolute value of the slope of a simple linear regression fit to a `term`s frequency series. most_volatile select terms with the largest change throughout the covered time period; internally this is measured by the residual standard deviation of the linear model fit to a `term`'s time frequency series.
selectTerms	a character vector of term patterns, that terms are matched to for selection. `regular expression` syntax can be applied, e.g. if `c("^mod", "an", "el$", "^outbreak$")` is supplied for `selectTerms`, all terms that either start with 'mod' or contain 'an' or end with 'el' or the exact term 'outbreak' are matched. The arguments `selectBy` and `selectTerms` can be combined.

Value

a dataframe specifying trend metrics employed for selecting top terms, where:

term: a unique term
n_term_total: the total number of a term's occurrences in the dataset
slope: the slope coefficient of a linear model fit to this term's time frequency series
volatility: the residual standard deviation of a linear model fit to this term's time frequency series
trend: a categorisation of the term frequency trend