Introduction to dfrtopics

Andrew Goldstone

2016-07-23

This package seeks to provide some help creating and exploring topic models using MALLET from R. It builds on the mallet package. Parts of this package are specialized for working with the metadata and pre-aggregated text data supplied by JSTOR’s Data for Research service; the topic-modeling parts are independent of this, however.

This vignette will explain how to use the functions here to:

  1. Load and prepare text data for modeling
  2. Train a topic model using MALLET
  3. Save and load the results of a model
  4. Explore modeling results

This package works best in conjunction with dplyr and ggplot2, though the latter is not a formal requirement of this package since only the handful of visualization functions need it. I load these, as well as the very useful utility packages stringr and lubridate, here:

options(java.parameters="-Xmx2g")   # optional, but more memory for Java helps
library("dfrtopics")
library("dplyr")
library("ggplot2")
library("lubridate")
library("stringr")

Loading and preparing text data

DfR data has two components: document metadata and word counts. (Usually the “words” are indeed single words, but they could be bigrams or trigrams.) Other document data for “non-consumptive use” comes in similar forms. Such data cannot be “read” but it can be modeled and analyzed, and topic models are a good tool for such an analysis.

I am going to walk through an example using a more-or-less arbitrarily chosen data set: I have downloaded all items classified as full-length articles appearing in PMLA or Modern Philology between 1905 and 1915. Let us construct a model of this corpus. To follow along with this vignette, you will have to obtain this sample data, since JSTOR’s terms of use do not allow me to redistribute it (though these journal issues are themselves in the public domain). It can be freely downloaded by signing up for an account on Data for Research, and then using this link to make a Dataset Request for wordcounts and metadata for these items in CSV format. So, although this is the location of these data on my system:

data_dir <- file.path(path.package("dfrtopics"), "test-data",
                      "pmla-modphil1905-1915")

if you are following along, set data_dir to whatever directory you have unzipped the JSTOR data to.

First we load metadata: it won’t be used in “vanilla” LDA modeling, but it is useful to have at this stage in case we want to filter the corpus.

metadata_file <- file.path(data_dir, "citations.tsv")
meta <- read_dfr_metadata(metadata_file)

read_dfr_metadata also accepts a vector of filenames, if you are working on multiple DfR downloads with their separate metadata files.

The word counts can be loaded into memory all at once with read_wordcounts, which takes a vector of file names.

counts <- read_wordcounts(list.files(file.path(data_dir, "wordcounts"),
                                     full.names=T))

This will take some time for large numbers of files, and is of course limited by the amount of available memory on your machine. It displays a progress bar as it runs. For the 662 documents here, the resulting data frame has 1270423 rows.

Tailoring the corpus

The counts are not quite ready to be passed on to MALLET. It’s important to be able to modify the representation of the documents we pass on to modeling. Here are a few things we might want to do:

  1. Filter documents. Here we use the metadata. Let us say we decide to ignore the year 1905 and start with 1906 instead.

    counts <- semi_join(counts,
            meta %>%
                select(id, pubdate) %>%
                filter(year(pubdate) != 1905),
        by="id")

    Now we have worked our way down to 607 documents.

  2. Filter documents by length. LDA sometimes performs poorly if the documents are not of roughly uniform length. JSTOR also sometimes classifies very short items as “articles.” This is the stage at which to remove them:

    Let us say we wish to discard any document less than 300 words long:

    counts <- counts %>%
        group_by(id) %>%
        filter(sum(weight) > 300)

    Only a couple of documents are removed this way.

  3. Filter stopwords. MALLET can also do stopword filtering (pass the filename of a stoplist to make_instances), but sometimes we want to assess, for example, how many tokens we are throwing out with a given stoplist. I have included a copy of MALLET’s default English stoplist in this package.

    Here’s how we might tabulate how many words stoplisting will remove from each document:

    stoplist_file <- file.path(path.package("dfrtopics"), "stoplist",
                               "stoplist.txt")
    stoplist <- readLines(stoplist_file)
    counts %>%
        group_by(id) %>%
        summarize(total=sum(weight),
                  stopped=sum(weight[word %in% stoplist]))
    ## # A tibble: 605 x 3
    ##                 id total stopped
    ##              <chr> <int>   <int>
    ## 1  10.2307/3693731 10073    3028
    ## 2  10.2307/3693732  6855    3798
    ## 3  10.2307/3693733  4791    2554
    ## 4  10.2307/3693734  5029    2694
    ## 5  10.2307/3693735  7360    3765
    ## 6  10.2307/3693736  2080     918
    ## 7  10.2307/3693737  1097     547
    ## 8  10.2307/3693738  7248    4213
    ## 9  10.2307/3693739   739     399
    ## 10  10.2307/432387  5692    3132
    ## # ... with 595 more rows

    As always, Zipf’s law is rather remarkable. In any case, we can remove stopwords now with a simple filter or (equivalently) wordcounts_remove_stopwords.

    counts <- counts %>% wordcounts_remove_stopwords(stoplist)
  4. Filter infrequent words. OCR’d text in particular is littered with hapax legomena. The long tail of one-off features means a lot of noise for the modeling process, and you’ll likely want to get rid of these.

    For example, to eliminate all but roughly the 20,000 most frequent features:1

    counts <- counts %>%
        wordcounts_remove_rare(20000)

    You should probably do this after stopword removal if you want the rank threshold to correspond (more or less) to the number of features you retain.

    You could also eliminate features with low total frequency:

    counts <- counts %>%
        group_by(word) %>%
        filter(sum(weight) > 3)

    This is a no-op in this case: such features were all ranked below 20000.

  5. DfR wordcounts are already case-folded (all lowercase), but you may wish to transform the features further, by, for example, stripping accents, normalizing orthography, or stemming. Applying these transformations to counts is straightforward with dplyr, and no functions are provided for this. The stringi library has powerful casefolding and Unicode normalization features which I often use in this context.2

Preparing the MALLET input data format

MALLET cannot accept our counts data frame from R as is. Instead, it requires a data frame with one row per document, which it will then tokenize once again. MALLET can also remove stop words and case-fold (make everything lowercase). We have already done stop word removal. And in the case of data from DfR, the tokenization and case-folding has been done for us. To deal with this case and create the MALLET-ready input, which is called an InstanceList, we use:

ilist <- wordcounts_instances(counts)

For more control over this process, it is also possible to do this in two stages. The conversion from counts back to (disordered, stopword-removed) texts is done by the function wordcounts_texts. Any data frame with one document per row can be turned into an InstanceList with make_instances, which can also make use of MALLET’s tokenization and stopword-removal features if you wish. wordcounts_instances just calls these two functions in succession, but its default parameters are adjusted for cases like DfR data. Even if you are operating on full texts, you may well wish to tokenize and casefold using different methods than MALLET offers, and then use wordcounts_instances. If not, you can pass full texts directly to make_instances.

An InstanceList can be saved to disk with the write_instances function; I usually run the corpus-construction step separately from modeling, since I typically make many models of the same corpus. (And this lets us get all the intermediate forms of the corpus text out of memory.)

Training a topic model

Now we launch the LDA algorithm with:

m <- train_model(ilist, n_topics=40,
                 n_iters=300,
                 seed=1066,       # "reproducibility"
                 metadata=meta    # optional but handy later
                 # many more parameters...
                 )

ilist here can also be the name of a file created by write_instances (or by command-line MALLET). Though I have supplied defaults for the many parameters for the modeling process, I have no idea of the kinds of corpora for which those defaults are sensible. It’s important to adjust all the parameters (through experimentation if no principled method is to hand). See ?train_model.

Note, in particular, that if we want to get exactly the same model more than once, we should set MALLET’s random number seed with the seed parameter here.3

The result, here stored in m, is an object designed to gather together all the relevant model information. It has S3 class mallet_model. Though it is really just a list, the package provides accessor functions, and you can treat m as though it were opaque.

The metadata supplied here as a parameter to train_model is not used in modeling, and is in fact an optional parameter. However, it is convenient to store metadata alongside the model for further analysis. If passed in here, it is accessible as metadata(m). (You can also do this any time with metadata(m) <- meta.)

Saving and loading the results

Though this 605-corpus needs only minutes to model, it often takes hours or more to produce a topic model of even a moderately-sized corpus. You are likely to want to save the results. It is most convenient, I have found, to save both the richest possible MALLET outputs and user-friendlier transformations: many analyses need only the estimated document-topic and topic-word matrices, for example. For this reason, the default write_mallet_model function takes the results of train_model and outputs a directory of files.

write_mallet_model(m, "modeling_results")

Consult write_mallet_model for the list of outputs. By default and an overabundance of caution, this function saves quite a few files, including two big, wasteful representations of the Gibbs sampling state: MALLET’s own and a simplified CSV. These files run to gigabytes on models of even medium-sized corpora.

The resulting set of files can be used to reconstruct the model object with a single call:

m <- load_mallet_model_directory("modeling_results",
    metadata_file=metadata_file)

(To specify individual filenames, use load_mallet_model.) This approach obscures, however, a series of choices I have made about which model outputs you are likely to want to load at a time. First of all, the loading function cannot reload the MALLET model object into memory. This is a limitation of the R-MALLET bridge: the RTopicModel class has no serialization method. (The ParallelTopicModel object does, however, but normally you won’t use it.) Second of all, the loading function assumes you do not normally want to load the final Gibbs sampling state into memory. That can be done separately (see “The sampling state” below). Third of all, even the topic-word matrix is normally so large that it can pose problems to R. By default it is not loaded (rather, a sparser file listing just “top” words within each topic is loaded). But simply pass

m <- load_mallet_model_directory("modeling_results",
    load_topic_words=T,
    metadata_file=metadata_file)

to get the full topic-word matrix if you wish to work with it. summary will indicate which components are present in memory.

summary(m)
## A topic model created by MALLET
## 
## Number of topics: 40
## Number of documents: 605
## Number of word types: 20592
## 
## Locally present:
## 
## MALLET model object:     no
## MALLET instances:        no
## doc-topic matrix:       yes
## top words data frame:   yes
## topic-word matrix:      yes
## vocabulary:             yes
## document ids:           yes
## hyperparameters:        yes
## sampling state:          no

Even if a component is not locally present, if it is possible to infer from other available components the package functions will do so. This somewhat cumbersome design is meant to help with the sometimes formidable task of keeping within memory limits.4

The components of the model are accessed as follows:

component accessor
document-topic matrix doc_topics(m)
topic-word matrix topic_words(m)
vector of word types vocabulary(m)
vector of document ID’s doc_ids(m)
metadata metadata(m)
Java model object RTopicModel(m)
Gibbs sampling state sampling_state(m)
estimated hyperparameters hyperparameters(m)
modeling parameters modeling_parameters(m)

If you have run MALLET another way but would like to use any of the data-manipulation and exploration functions here, the package can create a mallet_model object from MALLET’s sampling-state output (and its InstancesList file input). The function is load_from_mallet_state. See its documentation for more details. (Finally, if you were one of the brave few who experimented with earlier versions of this package, which produced slightly different outputs, consult ?load_mallet_model_legacy.)

Exploring model results

A good sanity check on a model is to examine the list of the words most frequently assigned to each topic. This is easily obtained from the topic-word matrix, but this is such a common operation that we have a shortcut.

top_words(m, n=10) # n is the number of words to return for each topic
## # A tibble: 400 x 3
##    topic     word weight
##    <int>    <chr>  <int>
## 1      1      two   3602
## 2      1 evidence   1779
## 3      1 original   1472
## 4      1     fact   1452
## 5      1    lines   1410
## 6      1     case   1350
## 7      1    found   1221
## 8      1     line   1086
## 9      1    given   1029
## 10     1 question    968
## # ... with 390 more rows

This data frame is in fact separately saved to disk and stored, even if the full topic-word matrix is not available. It is in essence a sparse representation of the topic-word matrix.5

As even this data frame is too long to read if you have more than few topics, a conveniently human-readable summary can be generated from

topic_labels(m, n=8)
##  [1] "1 two evidence original fact lines case found line"               
##  [2] "2 qu plus rousseau cette aux avait faire paris"                   
##  [3] "3 love man god world life death light heart"                      
##  [4] "4 edition pope english translation published notes hagedorn work" 
##  [5] "5 qe italian fo danois oit cor fu dist"                           
##  [6] "6 goethe schlegel schiller friedrich german first hebbel wilhelm" 
##  [7] "7 forms words pl form sg found french person"                     
##  [8] "8 king sir life letter england year church court"                 
##  [9] "9 ms paris cit france guillaume rimes roman poem"                 
## [10] "10 play shakespeare plays kh heywood printed chapman authorship"  
## [11] "11 ballad ballads story wife popular tale fabliau husband"        
## [12] "12 spenser poem poet stanza poems sonnets pastoral love"          
## [13] "13 pe pat cain god ms hem english ant"                            
## [14] "14 mhg ohg oe goth mlg lat mdu meaning"                           
## [15] "15 mas spanish dixo cervantes pues vn libro latin"                
## [16] "16 beowulf poem poet pearl ff story passage seems"                
## [17] "17 vnd saga nit daz sy man zi dann"                               
## [18] "18 latin songs song mediaeval poetry lyric century german"        
## [19] "19 language modern literature association american new study life"
## [20] "20 ff celtic grail story king irish fairy vss"                    
## [21] "21 grant quant iou tant dist fu estoit quil"                      
## [22] "22 first time made part seems says far later"                     
## [23] "23 italian italy french art fiction paris works work"             
## [24] "24 haue loue hath fol ms made noble good"                         
## [25] "25 action report reports drama scene dramatic act balzac"         
## [26] "26 stage play shakespeare plays scene elizabethan act hamlet"     
## [27] "27 nur man hat noch war kunst haben fiir"                         
## [28] "28 ms fol text quem alleluia sunt young versus"                   
## [29] "29 chaucer troilus medea story ff legend age boccaccio"           
## [30] "30 literary form poetry fact sense criticism new literature"      
## [31] "31 poe poetry painting lessing art poet nature idyl"              
## [32] "32 form genitive english subjunctive use latin old past"          
## [33] "33 story king two hero love death knight version"                 
## [34] "34 author piers plowman text work passage mss manly"              
## [35] "35 chaucer tale gower sins prologue wife tupper lines"            
## [36] "36 lines poems two stanza first poem form stanzas"                
## [37] "37 fox poem green fables fable version wolf flower"               
## [38] "38 english century found book romance version latin work"         
## [39] "39 dial wr dutch ll make small walk schr"                         
## [40] "40 ff play plays planctus christ scene poem liturgical"

By the same token, it is often instructive to consider documents that are most fully captured by a given topic. These are found with

dd <- top_docs(m, n=3)
head(dd)
##   topic doc    weight
## 1     1 564 0.5597620
## 2     1 279 0.5521426
## 3     1 251 0.5281459
## 4     2 585 0.9962404
## 5     2 189 0.9896540
## 6     2 371 0.9889636

The doc column here is simply the row-index of the document. To see what documents these are, we can make use of the associated metadata.6 Here is how we would derive the three “top” documents for topic 35, which we labeled topic_labels(m)[35]:

ids <- doc_ids(m)[dd$doc[dd$topic == 35]]
metadata(m) %>%
    filter(id %in% ids) %>%
    cite_articles()
## [1] "John Linvingston Lowes, \"Chaucer and the \"Miroir de Mariage\" (Continued),\" *Modern Philology* 8, no. 2 (October 1910): 165-186."
## [2] "Frederick Tupper, \"Chaucer and the Seven Deadly Sins,\" *PMLA* 29, no. 1 (January 1914): 93-128."                                  
## [3] "John Livingston Lowes, \"Chaucer and the Seven Deadly Sins,\" *PMLA* 30, no. 2 (January 1915): 237-371."

These titles suggest that the top words of the topic have not misled us: this is a “Chaucer” topic. (The typo in “John Linvingston Lowes” is in the original data.)

Topics, time, metadata

Though the LDA algorithm run by MALLET here makes no use of the time metadata, it is often instructive to see how the modeled topics are spread over time in a corpus of JSTOR articles. For convenience, this operation is condensed into the topic_series function:

srs <- topic_series(m, breaks="years")
head(srs)
##   topic    pubdate     weight
## 1     1 1906-01-01 0.05454418
## 2     1 1907-01-01 0.02907561
## 3     1 1908-01-01 0.05912942
## 4     1 1909-01-01 0.06755607
## 5     1 1910-01-01 0.04966935
## 6     1 1911-01-01 0.07378674

This is a “long” data frame suitable for plotting, which we turn to shortly. But it is important to underline that topic_series is a special case of the more general operation of combining modeled topic scores for groups of documents. That is, one of the main uses of a topic model is to consider estimated topics as dependent variables, and metadata as independent variables.7

To make this more general operation a little easier, I have supplied generalized aggregator functions sum_row_groups and sum_col_groups which take a matrix and a grouping factor. As a simple example, suppose we wanted to tabulate the way topics are split up between the two journals in our corpus:

journal <- factor(metadata(m)$journaltitle)
doc_topics(m) %>%
    sum_row_groups(journal) %>%
    normalize_cols()
##                       [,1]      [,2]     [,3]      [,4]      [,5]
## Modern Philology 0.6062663 0.5224128 0.465474 0.5571148 0.4765696
## PMLA             0.3937337 0.4775872 0.534526 0.4428852 0.5234304
##                       [,6]      [,7]      [,8]      [,9]     [,10]
## Modern Philology 0.4574049 0.5817597 0.5206216 0.6226078 0.5715734
## PMLA             0.5425951 0.4182403 0.4793784 0.3773922 0.4284266
##                      [,11]     [,12]     [,13]       [,14]     [,15]
## Modern Philology 0.1864713 0.3652563 0.1991306 0.997871287 0.8480551
## PMLA             0.8135287 0.6347437 0.8008694 0.002128713 0.1519449
##                      [,16]     [,17]     [,18]     [,19]     [,20]
## Modern Philology 0.4926399 0.8917364 0.8362099 0.1580163 0.6748687
## PMLA             0.5073601 0.1082636 0.1637901 0.8419837 0.3251313
##                      [,21]     [,22]    [,23]     [,24]     [,25]
## Modern Philology 0.6724859 0.5022084 0.433678 0.2779863 0.7921445
## PMLA             0.3275141 0.4977916 0.566322 0.7220137 0.2078555
##                      [,26]     [,27]     [,28]     [,29]     [,30]
## Modern Philology 0.6667635 0.5779151 0.3618173 0.2877688 0.3889264
## PMLA             0.3332365 0.4220849 0.6381827 0.7122312 0.6110736
##                      [,31]     [,32]     [,33]     [,34]     [,35]
## Modern Philology 0.1685901 0.5311219 0.4909804 0.6989145 0.3994684
## PMLA             0.8314099 0.4688781 0.5090196 0.3010855 0.6005316
##                      [,36]     [,37]     [,38]     [,39]     [,40]
## Modern Philology 0.5580017 0.6089927 0.4441859 0.8709987 0.5632313
## PMLA             0.4419983 0.3910073 0.5558141 0.1290013 0.4367687

Here we might notice certain topics that skew to one or the other of the two journals in our corpus—for example, 11 ballad ballads story wife popular tale fabliau husband.

By the same token, if one wanted to construct super-groupings of topics, one could use sum_col_groups to aggregate them together.8

Visualization

The complexity of a model is often easier to grasp visually than numerically. Instead of providing a comprehensive set of possible visualizations, the package tries to simplify the process of generating the sorts of data frames that can be easily plotted, especially with ggplot2. In addition to the grouping operations I have just mentioned, I have also supplied a simple function, gather_matrix, for turning a matrix into a “tidy” data frame.

Nonetheless, I have supplied a few functions that use ggplot2 to give some overviews of aspects of the model. They operate in pipelines with the functions for generating data frames. Rather than supply many parameters for tuning these visualizations, I find it makes more sense to make generating plot-ready data frames easy, and then leave the viz-fiddling to your expertise. Please use the source code of the package’s ready-made plotting functions as a starting point. None of them do anything elaborate.

To visualize the (heaviest part of) a topic-word distribution:

top_words(m, n=10) %>%
    plot_top_words(topic=3)

To place the topics in a two-dimensional space:9

topic_scaled_2d(m, n_words=2000) %>%
    plot_topic_scaled(labels=topic_labels(m, n=3))

Rather pleasingly, some of the spatial organization of this plot appears to be interpretable: purely “philological” topics are mostly closer together, and mutually more distant from more “literary-historical” topics.

The time series mentioned above (if our time metadata is meaningful) can be visualized in a faceted plot:

theme_update(strip.text=element_text(size=7),  # optional graphics tweaking
             axis.text=element_text(size=7))
topic_series(m) %>%
    plot_series(labels=topic_labels(m, 2))

In this case the plot will not so much reveal trends as it will indicate which particular years have items highly concentrated in one topic or another. The identical vertical scales may appear an annoyance, but this is in fact a useful way of seeing that one topic captures a great deal of the corpus and is probably not particularly meaningful: 22 first time made part seems says far later.10 We can of course filter topic_series(m) to drop this topic from our display.

The topic_report function generates a folder full of these plots for all the topics in the model.

topic_report(m, "plots")

For more detailed browsing, the package can export a model visualization using my dfr-browser. This is a JavaScript-based web-browser application which can be used to explore the model in an interactive way. The function call

dfr_browser(m)

will create the necessary files in a temporary folder and then open the dfr-browser in your web browser. To export to a non-temporary location for later viewing or further customization, pass a folder name: dfr_browser(m, "browser"). To export a browser with the data stored in a series of separate files that can be loaded asynchronously, use dfr_browser(m, "browser", internalize=F). Then, in the shell, run

cd browser
bin/server

and visit http://localhost:8888 in your web browser. This last option is best if the visualization is meant for the web.11

A more elaborate visualization: a word’s topic assignments

For a final, somewhat more complicated exploration, let’s visualize the allocation of a single word among various topics over time. This functionality is actually provided in the package by plot_word_topic_series, but this function is implemented on top of functions with more general uses, so walking through the implementation will help clarify what the more general functions in the package can do, in particular in conjunction with the Gibbs sampling state from the topic model.

Let us return to the model we constructed above. Let’s consider a word which appears prominently in multiple topics:

w <- "poem"

Having noted that the word poem is prominent in multiple topics, we can ask whether the model allocates it among topics uniformly over time. We can’t answer this question using the document-topic matrix or the topic-word matrix, so we turn to the Gibbs samplings state. This is not present in memory by default, so we load it to the model with

m <- load_sampling_state(m,
    simplified_state_file=file.path("modeling_results", "state.csv"))
## Loading modeling_results/state.csv to a big.matrix...
## Done.

The package uses bigmemory to handle this object, as we discover if we access it:

sampling_state(m)
## An object of class "big.matrix"
## Slot "address":
## <pointer: 0x111c64530>
dim(sampling_state(m))
## [1] 693350      4

What we now want is to examine the topic-document matrix conditional on the word poem. This is easy to do with the mwhich function from bigmemory, but as a convenience this package provides a function for this particular application (as well as the the term-document matrices conditioned on a topic, tdm_topic):

topic_docs <- topic_docs_word(m, w)

The next step is to aggregate counts from documents in the same year. To do this we need a factor indicating which documents belong to the same year:

doc_years <- metadata(m)$pubdate %>%
    cut.Date(breaks="years")

Now we can aggregate columns of our matrix:

series <- sum_col_groups(topic_docs, doc_years)

series is a matrix in which rows are topics, columns are years, and the entries correspond to the total occurrences of poem within a topic in a year. These sums, however, are tricky to compare to one another, since the total number of words in the corpus varies from year to year. We should divide through by these totals, which are most easily found by grouping and summing the topic-document matrix, which we find by transposing the result of doc_topics and then doing two sets of sums:

total_series <- t(doc_topics(m)) %>%
    sum_col_groups(doc_years) %>%
    colSums()

Now we want to divide each column of series by the corresponding element of total_series. This is a simple matrix multiplication, but because I always forget whether to multiply on the right or the left, I have supplied a function with a clearer name:

series <- series %>%
    rescale_cols(1 / total_series)

Finally, the matrix series is not yet in “tidy” form for plotting: we have one row for each topic, whereas we need one row for each topic in each year. To unroll series into a long data frame, use gather_matrix:

series_frame <- series %>%
    gather_matrix(col_names=c("topic", "year", "weight"))

A good graphical representation of these proportions over time is a stacked area plot. ggplot makes this easy. But we don’t really want all topics with even one random allocation of poem on the plot. Let’s just pick the top topics overall for the word.

series_frame <- semi_join(series_frame,
    words_top_topics(m, 4) %>%
        filter(word == w),
    by="topic")

For one further refinement, we’ll add topic labels as well:

series_frame %>%
    mutate(topic=factor(topic_labels(m, 3)[topic])) %>% 
    mutate(year=as.Date(year)) %>%  # restore data type (sigh)
    ggplot(aes(year, weight, group=topic, fill=topic)) +
        geom_area() +
        labs(x="year",
             y="fraction of corpus",
             title=str_c('allocation of "', w, '" among topics'))

From this plot we can see the way poem moves among topics assigned to different poets and poems: in this sense the model “understands” the referential multiplicity of poem in this corpus.

Other package features

Not discussed here are a few parts of the package which help to make the bridge from R to some of MALLET’s other features for topic models. There are a series of functions for handling InstanceLists, and in particular for converting such a list into a term-document matrix (instance_Matrix—the capital M because the result is a sparse Matrix object). I provide read_diagnostics and write_diagnostics methods to access MALLET’s own set of model diagnostics.

I have included some functions for MALLET’s “topic inference” functionality, where we use an already-trained model to infer topics for new or held-out documents. The core function is infer_topics, which returns a model object m whose document-topic matrix is available as doc_topics(m).

The package also contains an experimental implementation of a posterior predictive check of the model fit which may help to diagnose the quality of individual topics and of the overall model. The check is described in the help files for imi_check and mi_check. I make no guarantee that the implementation is correct (and would welcome diagnoses or verifications).

Even more experimental is a function for finding topics that are similar across models: align_topics (see the help file and references to other functions there). In order to make it possible to align models from other topic modeling packages, I supply some simple glue: the foreign_model function will “wrap” a model from the topicmodels or stm packages in an object that can be used with this package’s functions. Again I urge caution in using these functions, since I have not yet carefully validated them.


  1. “Roughly” because of ties.

  2. A 2016 paper suggests that stemming does not improve the performance of topic models: Schofield and Mimno, “Comparing Apples to Apple: The Effects of Stemmers on Topic Models.”

  3. There are two senses in which you might want your modeling to be reproducible, however: your exact outputs should be reproducible, and any substantive features of interest should probably be independent of the [pseudo]-randomness of the modeling process.

  4. A more sophisticated solution would be to allow pieces to be stored on disk and load them as needed. I find R’s functional style makes this quite hard to arrange without an exhausting proliferation of parentheses.

  5. By default this matrix contains integer counts of topic-word assignments, not probabilities of words in topics. For the purpose of finding top words this does not matter. See the help for top_words, however, for notes on different word-scoring schemes.

  6. Recall that we had metadata for more documents than we modeled, because we discarded some documents in the corpus-creation step. However, when metadata is provided for a model, the package selects and reorders rows so that rows of metadata(m) correspond to rows of doc_topics(m) and entries in doc_ids(m).

  7. A more formal way to do this, however, requires more elaborate modeling. See, in particular, the stm package for the Structured Topic Model.

  8. These are simple arithmetical operations, of course, and you may ask why we do not stick to the dplyr idiom all the way through. But converting a full document-topic or topic-word matrix to a data frame—as dplyr would require—can be a cumbersome operation. It makes more sense to stay with the matrices until the final aggregates have been created.

  9. This requires the full topic-word matrix be loaded, though you can speed up the calculation by changing the value of the n_words parameter.

  10. For algorithmic diagnostics of topic quality, try examining MALLET’s topic diagnostics. In this case, the revealing quantity is the topic’s distance from the overall corpus (by K-L divergence).

    d <- read_diagnostics(file.path("modeling_results", "diagnostics.xml"))
    which.min(d$topics$corpus_dist)
    ## [1] 22
    # in terms of standard deviations from the mean distance:
    sort(scale(d$topics$corpus_dist))[1:3]
    ## [1] -2.237928 -1.165971 -1.084532

    Topic 22 is much closer to the corpus (by K-L divergence) than the other topics. (The MALLET “coherence” measure is not useful in this case.)

  11. If you already have a copy of the dfr-browser JavaScript/HTML/CSS, you can also export only data files, using the export_browser_data function. Note that dfr_browser sets internalize=T by default, whereas export_browser_data sets internalize=F.