View on GitHub

dfr-browser

Take a MALLET to disciplinary history

Explore the demo: a model of PMLA
Download this project as a .zip file Download this project as a tar.gz file

Thanks to the easy availability of counts of words occurring in JSTOR articles, humanists have grown interested in seeing what they can learn about those articles in the aggregate. How can we analyze this digital archive to learn about large-scale patterns in scholarly fields over the last century or more? Recently, one algorithmic approach to investigating patterns in a big set of documents has received quite a bit of attention in the humanities and social sciences: topic modeling with Latent Dirichlet Allocation (see this overview article from the sociology journal Poetics, this explanation for “English majors,”, the original paper about the algorithm, or the essay I co-wrote with Ted Underwood, The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us).

So: take data from DfR—or another source of texts—wrangle it a little, feed it to MALLET, and poof! the algorithm yields a topic model. For example, a model of all the articles from the century-long run of PMLA. So what is that?

Try the demo →

I embarked on this little visualization side project to help me understand such models a little more clearly. The purpose of my DfR browser is to let users see the ways that a topic model attempts to classify words and documents through patterns of co-occurring words. My wish, as a literary scholar, was to interpret the classification of documents—and time-slices of corpora—into topics. But I am wary of the complexity of the overlapping, fuzzy classifications. This visualizer tries to put all the moving pieces of a model into the visual field: topics, documents, words, and (key to interpretation, yet not factored into a simple LDA model) document metadata.

What this browser shows

This browser, which I am making public in an alpha release, seeks to reflect the multifaceted nature of a topic model (it is a model of patterns in both words and documents) by offering you multiple views of the model. Navigate among the views by using the navigation links at the top of the screen.

The overview

In the overview, topics are represented by their most frequent words. There are four different versions of the overview. In any of them, click a topic to go to the view for that topic.

The grid subview

In the grid view, the topics are arranged in arbitrary order in a grid. Topics are represented by what I like to call Little Circles With Words in Them. The type size reflects a word's weight in a topic. Originally I wanted the circles to be word clouds, but the otherwise impressive d3-cloud is really not meant to squeeze few words into a very small space (it's much better at a bunch of words in a larger space). So I have opted instead for a simple column of words, with sizes decreasing from the center. Known issue: No effort is made to respect the horizontal bounds of the circle; the number of words is fixed by a parameter, which must be hand-tuned.

The scaled subview

In the scaled view, topics are placed close to one another if they are similar (in the sense of: having similar distributions of frequent words). To see overlapping topics more clearly, use the mouse to pan or zoom the view. The measure of topic similarity for the scaled view has to be precalculated. In the demo I use, following David Mimno, multi-dimensional scaling—i.e., principal coordinates analysis—to reduce the matrix of Jensen-Shannon divergences between topics, considered as distributions over words, into two-dimensional coordinates. The resulting spatial representation does not always produce the most intuitive juxtapositions, but it is sometimes suggestive. I have not done more fine-tuning for this alpha release.

The list subview

In the list view, the topics are listed in table form. To help compare topics among one another, the table can be sorted in several ways. By default topics come in an arbitrary order. Click the "top words" header to see the list sorted alphabetically by most frequent words. On the far right is the proportion of all the words in the corpus assigned to the given topic, visualized as a length (the blue bar). This sorting by corpus proportion is deceptive in the sense that the highest-proportion topics are often the least interesting parts of the model—agglomerations of very common words without a clear thematic content.

The column to the left of the topic key words shows a miniature graph of the topic time-series (visible more fully in the topic view; see below). This gives a rough sense of the distribution of the topic over document years. Click the column header to sort by the year in which the topic attains its maximum proportion. Note that the y-axes of these miniature bar charts are not all on the same scale.

The stacked subview

One wants to have a sense of the shifting composition of the corpus in terms of topics over time. Seeing this as a whole is a challenging visualization problem, and I lack the graphical sense to solve it well. However, d3 offers some options which I am experimenting with. In the stacked or conditional overview, all the time series for the topics are stacked on top of one another in a streamgraph (see the d3 documentation). The height of each topic's stream gives the proportion of the topic in a given time interval. The topics are reordered by a heuristic for making the visualization less jagged. Colors are used to distinguish topics from one another, but in this alpha release I have given up on trying to come up with a unique color for each topic. The program attempts to stick topic labels in reasonable places, with uneven success. Nonetheless, this view does allow a chance to pick out islands of prominence or topics with unusual spreading over the time space. To take exploration a little further, it is possible to pan and zoom the view (hold down shift while dragging).

By default the stacked subview stacks up the time series of percentages of the year's words accounted for by a topic. If the corpus has many more words in some years than others, this might be deceptive. So I have left on the option of shifting the view to display raw counts of words assigned to topics per time unit. In the demo here, one immediately notices the oddity of 1952, when PMLA published some extraordinarily long essays.

The topic view

This view (see an example from the demo, novel story narrative) gives a fuller sense of the make-up of a topic. The left column shows the topic as a distribution over words.

The choice of how to represent the relation of topics and documents is harder. I list the documents where the topic is at its largest proportion (the blue bars visualize this proportion), since these are typically in some sense more “characteristic” of the topic. But pay attention to the size of the proportions as well as the absolute weights, and remember that the document with largest proportion of topic k may nonetheless have an even larger proportion of another topic k'.

On the upper right is the tantalizing time-series, showing the changing proportion of words in the corpus assigned to the topic. I have chosen bars to emphasize that the model does not assume a smooth evolution in topics, and neither should we; this does make it harder to see a time trend, however. To focus on prominent documents in the topic for a given time interval, click a bar on the chart: thus, compare the overall top documents for novel story narrative with the topic's top documents in 1980. (In fact, the model assumes that knowing an article's date of publication gives no information about topics: all models are wrong, as the saying partly goes.)

The time series in the demo groups documents in one-year intervals. This interval can be changed, or another metadata variable, rather than publication date, can be substituted in the graphs of topics’ conditional distribution. For details on the configuration of the visualization, which is managed with a JSON file, see the section on conditioning on metadata in the repository homepage.

The document view

The document view (see an example from the demo, Stallybrass’s “Against Thinking”) simply represents the estimated proportions of the various topics in a given document. It also offers a link which should normally lead to the document on JSTOR itself.

Known issue: Though at present links of the form jstor.org/stable/DOI/ appear to work, I don't think this is guaranteed. Unfortunately, though DfR supplies DOI's, the dx.doi.org links for those DOI's do not resolve correctly.

The word view

The word view’s main purpose is to remind us that the topic model tends to divide occurrences of each word among multiple topics (an example from the demo: world). It displays the top words in each of the topics in which a word is highly ranked; the bars indicate the relative weight within the topic of the words, so that you can see how the focal word compares with the other key words in each topic. Click a word to focus the visualization on that word. Use the text box at the upper right to look up another word in the model.

Again, space limitations mean the browser cannot make use of the full information the model provides about which words in each document have been assigned to which topics.

The word index view

This is a list of the vocabulary of prominent words in any topic (demo). Each word links back to the word view. It is not identical with the vocabulary of the topic model, since any word which is not among the top “key words” supplied to the browser will not be listed. In future I may investigate the feasibility of using the whole topic-word matrix and not just the “key words”: this matrix could be manageable for relatively restricted vocabularies.

The bibliography view

This view focuses on documents (demo). It lists all citations for all the documents included in the model. Because this is typically a lot of documents, documents are sorted under headings (either letters of the alphabet or decades or years or journals), and you can jump around the headings using the floating block of links on the left side of the page. If you’d like to search on the page, use the browser's own Find function. The menu gives a few sorting options, of which year/author and year/journal contents are probably most useful.

Known issues: the sorting seems to be glitchy for metadata about nineteenth-century articles; the generated citations are imperfectly formatted. Also, this view can take a couple of seconds to load.

Settings

This very minimal dialog box lets you adjust the number of words and documents listed with each topic. There is also a setting to reveal hidden topics (if any).

Permalinks

Thanks to this post by Elijah Meeks, I learned how to give each view a permalink, which you can copy from your browser’s URL bar. That link takes you directly to the topic, word, document, etc. view you are looking at. For example, the demo’s view of the topic arbitrarily numbered 38 (novel story narrative) is at agoldst.github.io/dfr-browser/demo/#/topic/38.

Technical note: The part of the URL after the #/ describes the chosen view into the model. For the implementation, see the refresh function in the controller source, dfb.js. This function is set as the hashchange handler.

About the demo

The working example explores a topic model of articles from the journal PMLA. More specifically, what it shows is the results of allowing MALLET's Latent Dirichlet Allocation algorithm to categorize those articles into 64 "topics" or patterns of co-occurring words. The resulting model simultaneously gives some information about words that occur together in this corpus, about documents that are related by their shared patterns of language use, and about (approximately) the thematic make-up of the documents. The visualization also indicates trends in the proportions of a topic in all the documents in with a given publication year.

This model was constructed with the help of my dfrtopics R package, which gives an interface for topic-modeling JSTOR (or similar) data with MALLET and exploring the results; for a tutorial in using the package, see my introduction to dfrtopics. There are of course many other ways to make topic models, with MALLET or other software. The data for the demonstration model consists of all PMLA items from JSTOR's Data for Research service, restricted to items categorized as “full-length articles” with more than 2000 words (this leaves 5605 articles out of the 9200 items from the years 1889–2007). All but the most ten thousand frequent words and a fairly large set of stop words are also removed. The model has 64 topics; having experimented with more and fewer topics, this seemed to produce a reasonable, though far from perfect, broad thematic classification.

The data files used in the demo can be downloaded from this site if you wish to look at how they are formatted: info.json, meta.csv.zip, tw.json, dt.json.zip, topic_scaled.csv.

For more on interpreting topic models of literary scholarship, see my NLH essay with Ted Underwood (preprint); dfr-browser is used for our accompanying interactive site.

Browsing your own models

To build your own browser using this code, grab the source on github, drop the necessary data files in the data subdirectory, and launch a local webserver. The github repository homepage includes more details on how to create those data files and run dfr-browser, as well as some pointers on how to tune the visualization parameters and a couple of notes on adapting the source code to other uses. My dfrtopics R package offers a convenient way to create dfr-browsers of topic models, as well as some more flexible ways to explore modeling results.

dfr-browser in use

I hope others might use or transform this software for their own purposes. If you do produce a site with dfr-browser that you'd like to share, please send me a note, and I'll add a link to it here.

I have already mentioned my and Underwood's Quiet Transformations: A Topic Model of Literary Studies Journals.

This browser was also adapted for my work with Signs on their fortieth anniversary special, Signs@40, in An Interactive Topic Model of Signs (made in collaboration with Susana Galán, C. Laura Lovin, Andrew Mazzaschi, and Lindsey Whitmore).

Jonathan Goodwin has adapted dfr-browser to produce a topic-browser of fiction in HathiTrust from 1920–1922 as well as a browser of topics in Modernism/modernity.

License

This software is free to use, copy, modify, and distribute under the MIT license (which is to say, please credit me if you do use it). Though it is woefully under-documented, and no doubt bug-ridden and amateurish—which is why this is still an alpha release—perhaps it may be useful to others with similar interests. Please clone the repository and go to town.

As this is an alpha release, I have not tested this browser beyond my own system, where it appears to work in Firefox, Chrome, and Safari. I have relied on d3 and bootstrap, so in principle this should work in recent web browsers on most systems, with the likely exception of Internet Explorer. Known issue: As for touch devices, I don’t do too much with hover states, but my design is not quite fully responsive, and I have found problems on small screens. Further development for the iMallet will have to await the work of some other, braver person.

The polished options

I looked into other topic-model visualization projects and was impressed by what I found. But I needed a setup which was (1) entirely static, since I didn’t have access to a host for dynamic code and (2) tuned to the questions that looking at models of scholarly journals has raised for me. And I found working with d3.js to be such a pleasure that I wanted to complete a project in it myself.

Hence this alpha release of something tuned to my own interests and my work on JSTOR’s Data for Research data. If you are looking for a more polished general-use model-browsing project, here are some of the options in a burgeoning field:

LDAvis, by Carson Sievert, uses the Shiny R server to support statistically-sophisticated visualizations of both individual topics and the interrelations among topics.

The Networked Corpus is a beautiful way to visualize a topic model of a full-text corpus.

David Mimno’s jsLDA computes the topic model in the browser as well as visualizing it in interesting ways.

Jonathan Goodwin’s journal topic-model browsers are very elegant: see e.g. this one, of literary theory journals.

Termite, implemented by Jason Chuang and Ashley Jin (paper by Chuang, Manning, and Heer), shows topics and words in visualizations geared toward the crucial task of assessing the quality of a model.

stmBrowser, by Michael Freeman et al., generates interactive visualizations of models produced by the Structural Topic Model of Roberts, Stuart, and Tingley.

Allison Chaney’s Topic Model Visualization Engine is a robust static-site generator.

The Topic Modeling Tool provides a GUI to make MALLET use easier.

Version history

Recent versions of dfr-browser can be downloaded from the repository releases page. The download links at the top of this page are for the most up-to-date code, which may be between released versions. I make no promises that any version will work without problems, but I welcome bug reports.

June 8, 2016. v0.8a. Any metadata variable may be used for conditioning, not just year of publication.

April 7, 2016. v0.7. Further factoring of bibliography and metadata code, to make it easier to adapt this to non-JSTOR sources.

April 5, 2016. v0.6.1. Somewhat more fluidity for varying window/screen sizes, plus topic hand-labels. (Not separately released.)

(September 24, 2015. No code changes, but this page updated.)

September 23, 2014. v0.5.1. Some refactoring of bibliography code.

June 30, 2014. v0.5. A redesigned bibliography view. Time-zone-related bug in dates corrected. Working JSTOR article links.

June 3, 2014. v0.4.2. Bug fixes and a few extra configuration settings (model_view.cols and default_view).

May 28, 2014. v0.4.1. Bi-threading.

April 15, 2014. Second alpha release, with additional data views.

October 29, 2013. First alpha release on github pages.