Topic Galaxy Viewer

The topic modeling galaxy viewer continues some of our previous work in visualizing topic models. However, instead of focusing on visualizing documents using topic models as a previous visualization attempted we instead focused on visualizing the enormity of data embedded in the topic model itself.

The galaxy itself is a two dimensional representation of the n-dimensional probability vector space that the topic model itself represents. This reduction is calculated using a mixture of mathematical magic and the d3 force directed physics engine.

The galaxy visualizes almost everything buried in the standard output of the Mallet topic modeling engine. The distance between the centers of each bubble roughly represent how similar these topics are to each other. The size of each bubble represents the prevalence of that topic in the entire corpus. The color of the each bubble represents the internal entropy of the topic itself. Darker bubbles are more chaotic and are closer to the galactic, and mathematical, center of the galaxy represented by the single white bubble. Clicking on a bubble produces a list of the mallet assigned keys for each topic as well as a small bar depicting the relative importance of that key to that specific topic. Each topic is assigned a name that defaults to the first key in that topic, but can be changed by the user.

The visualization is also designed to work with a collection of plugins that can provide additional information about that topic. Pinning a topic moves it to the expandable bottom bar where it can be edited and further visualized. We built two plugins to compliment the main galaxy viewer. The first is the document viewer that displays the individual document topic makeup. The second is a simple token visualizer that uses metadata assigned each document to graph the raw counts of each instance of a topic over time. Other plugins have been suggested, such as word clouds, but not implemented.

Unfortunately, the Galaxy viewer does have some downsides. The formula used to compute the distance between two topics requires a double sum which scales poorly with the dimension of the topic. Without aggressively pruning low frequency words from the topic, reducing most topics from as many as a twenty thousand words to as few as two thousand, the algorithm can take days to complete. For this reason the viewer in its current form cannot accept user uploaded data even though the design and algorithm is intended to accept arbitrary Mallet output. For the same reason, the token viewer itself suffers. In its current form it only represents the keywords of a topic which frequently accounts for less than ten percent of the topic itself. Additional, the distance between each bubble is itself an approximation because we are trying to squeeze several thousand dimensions into a two dimensional picture. Misrepresentations are to be expected.

Project Lead: Geoffrey Rockwell

Design: John Montague

Programming: Ryan Chartier