topic_frequencies summarizes the shares of topics in a chosen time
interval as per provided topic shares by document and date.
topic_frequencies(topicsByDocDate, timeBinUnit = "week", minGamma = 0.01, minTopicTimeBins = 0.5)
| topicsByDocDate | a dataframe as returned by
|
|---|---|
| timeBinUnit | a character sequence specifying the time period that
should be used as a bin unit when computing topic share frequencies. Valid
values are |
| minGamma | the minimum share of a topic per document to be considered
when summarizing topic frequencies, topics with smaller shares per
individual document will be ignored when computing topic frequencies. (In
an |
| minTopicTimeBins | a double in the range |
a dataframe with term frequencies by chosen timebin, where:
a topic ID as provided as an input in
topicsByDocDate
the floor date of a timebin; if
timeBinUnit was set to week, this date will always be a
Monday
the median of likelihoods of the topic with
topic_id in timebin
the mean of
likelihoods of the topic with topic_id in timebin
the share of topic with topic_id relative to all
topic shares recorded and included in a given timebin.
NOTE: strictly speaking these are the likelihoods that a document
is generated from a topic, which we here interpret as the share of a
document attributed to a topic.
the total number of
documents in a dataset in which a topic with topic_id occurs as
least with likelihood minGamma
the exact date of
the first occurrence of a topic with topic_id across the whole time
range covered by timebins
the exact date of the
latest occurrence of a topic with topic_id across the whole time
range covered by timebins; note that this date can be larger than
the maximum timebin, as timebin specifies the floor date of a
time unit
the number of unique timebins in
a topic with topic_id occurs at least with likelihood
minGamma
A stm topic model provides for each document the likelihood
(gamma) that it is generated from a specific topic; here we interprete
these as the share of a document attributed to this topic and then summarize
these shares per timebin to obtain the share of a topic across all documents
over time.
The topic share or likelihood per document has to be above a threshold
specified by minGamma. A suitable threshold might consider the number
of topics and the average document size. An additional filtering option is
provided with minTopicTimeBins.
Timebins for which no occurrence of a given topic is recorded are added with an explicit value of zero, excluding however such empty timebins before the first occurrence of a topic and after the last.