| ScatterPlot creates a scatter plot graph of terms, spaced by their variation from one another. Once you arrive to ScatterPlot, insert / upload your content and let the tool perform its analysis. You may hover over these dots and click on them for more information. |
| Documentation | Attributes | User Supplied Tags | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Documentation: http://hermeneuti.ca/voyeur/tools/ScatterPlot
Author(s): Stéfan Sinclair and Geoffrey Rockwell
|
|
2010s, Canadian (language) |
Overview
The ScatterPlot text analysis tool, is part of the web-based Voyant toolset developed by Stefan Sinclair and Geoffrey Rockwell. ScatterPlot uses multivariate analysis to create a graph of terms, spaced by their variation from one another. The default view is a visualization of the top 50 terms, represented by blue circles, and all of the documents within the corpus, represented by orange diamonds. A larger circle indicates a term with higher frequency. Mouse over any blue circle to see the raw and relative frequencies of that term. Raw frequency is a simple count of word occurrences in the corpus, and relative frequency is the result of raw frequency/total number of words in the corpus.
Multivariate Analysis
ScatterPlot is tool to detect patterns among a large number of variables. Multivariate analysis is used to derive a Cartesian (X,Y) point for each term based on word frequency across the documents in a corpus. If there are 12 documents in my corpus, for example, then the equation will be based on 12 frequency counts for each word. If we have 12 words across twelve documents then we end up with 12 x 12 matrix of data elements. These variables are plugged into a into a statistical formula to find a single pair of coordinates for each term.
Analysis Techniques
Two techniques are offered for multivariate analysis: Principle Component Analysis (PCA) and Correspondence Analysis(CA). When the PCA technique is chosen from the Analysis drop down menu we see only the terms on the chart, when CA is selected we see the relationship between terms and between documents. Terms appear in closer proximity to documents in which they are more frequent. You can either accept the most frequent terms that are offered by default (apply the Tapor stop word list to eliminate common articles, conjunctions, and other les significant words) or you can enter your own set of terms for analysis.
Large Scale Patterns
You can choose to visualize between 10 and 100 terms simultaneously by using the Terms drop down menu. Unlike some of the other tools in the Voyant set, ScatterPlot is most useful when evaluating a large number of terms simultaneously. It can reveal large scale data patterns that would otherwise be difficult to detect. Scatter plots can reveal unusual feature in data sets like clusters, gaps, and outliers. The Clusters option provides an automated clustering of similar terms. Each Cluster is drawn in a different colour, with the term closest to the middle of each cluster represented as a triangle. This is a particularly useful visual guide in large data sets where it can be difficult to differentiate points that are tightly packed together.
Reading Graphs
In the following Principle Component Analysis graph one can quickly identify the term http (red arrow) as an outlier. The other terms are largely clustered around a line with a strong positive slope (as shown by the blue arrow). Strength refers to the degree of "scatter" in the plot. If the dots are widely spread, the relationship between variables is weak. If the dots are concentrated around a line, the relationship is strong.

If we switch to a Correspondence Analysis of the same data set, we can see that the terms http and humanist are more strongly correlated with documents that appear later in the corpus, and that the term x-humanist correlates with earlier documents. The word humanist is a frequently occurring term across the whole corpus, but its usage spikes in later documents:

Conclusions
ScatterPlot is a very interesting tool, but the learning curve may be steep for the less mathematically inclined. I had difficulty interpreting the meaning of the x & y axis – clearer labels would be extremely useful for the novice user. The tool leverages formal mathematical models for multivariate analysis, but one can derive useful information without a full grasp of the underlying formulas. This tool is a good starting point for pinpointing patterns and aberrations that may merit further investigation. ScatterPlot will yield the most interesting results when used alongside other tools like word trends and concordances that offer a complementary view of the data. Stéfan Sinclair has created a Voyant view called the scatter skin for just such a purpose. For a more in-depth description of the Correspondence Analysis method, see his blog entry.

February 18, 2012 10:47 AM