Answers to Lab 4 Questions

Question 1: Including stop words in the word cloud ultimately clogs up the visualization with words common across most writing in English. Words that aren’t necessarily “important” or that have particular specific meaning replace other unique words. Some of the most common words that weren’t considered stop words still appear, but are much smaller. We get less of a sense of the actual content of the articles.

Question 2: Raw frequency doesn’t take into account length of articles where certain words may be repeated. The rate of the words being used is more telling than just noting that the word students being used more in a 2000 word article than a 200 word article. Relative frequency displays, not just the total number of occurrences, but where words are being used consistently in one context. We can count clusters of words rather than terms distributed through documents. It can show us where words are important, not just their appearance across an entire corpus.

Question 3: The document is an article of complaint, an open letter about cuts to an honors program at the Morrissey College of Arts and Sciences, abbreviated MCAS. All of the appearances of “mcas” are located in this singular document. Most of them accompany the names of future graduates of the college in a list format that contribute to the petition of the open letter. The context of number of appearances of “mcas” as a term tells us that in the context of the questions we may want to answer about how the humanities are written about in news articles, this term is not as significant as the statistics may tell us. While these metrics might not always show us answers to our questions, they can help us eliminate that is not relevant in context and lead us to new research questions. (For instance, Rachel suggested in class that this might lead us to investigate the genre of complain letters in the humanities.)

Question 4: Some of the terms that are more associated with the sciences than the humanities are perhaps obvious. “Engineering” is more common in the science corpus, as is “technology.” These are terms that are more commonly associated with the scientific discipline. “New” is also more common to the science corpus, which seems to reflect the idea that the sciences are perhaps more innovative than the humanities. (This idea should perhaps come under a bit more scrutiny. The humanities corpus reflects some discipline specific vocabulary as well, “history,” “arts,” and “English” are all terms that are more related to the humanities content. The terms also seem to reflect a more institution-focused theme in the humanities content. “Majors,” “class,” “academic,” and “school,” are all more common across the humanities corpus. Interestingly, or perhaps infuriatingly, terms like “study” and “research” are more common across the science corpus. This would seem to suggest that institutional concerns are more common in the humanities articles (perhaps related to the trend of funding cuts from the previous question) while issues of actual work, research, and study in the sciences are at the forefront of that corpus.

Question 5: I played around with a couple of different tools including the Bubbles, TextualArc, and finally WordTree. I found Bubbles and TextualArc, both intended to visualize keywords and their relationships to the text, to be engaging as they were both animated tools. I was put off as following their visualizations was not necessarily intuitive to me. The WordTree tool was very clear and created a simple diagram that showed a keyword and its associations to other words. It is essentially a breakdown of the Contexts tool, but it shows multiple sentence structures. What I had hoped to see was the most common words associated with the keyword (I chose “humanities”) and the ways sentences were structured around that word. However, the Help page for the WordTree tool does proclaim a disclaimer that “the branches shown are not necessarily based on frequency.” The available information is interesting, and could potentially be fruitful in a more linguistic analysis of the corpus. However, the lack of specificity in how the WordTree terms are being selected might not give a full picture of how the terms are being used. This is supposed to be an image of the WordTree I looked at.

A Reflection on Exploratory Data Analysis

One of the things I have been enjoying about the Data-Sitters Club is their exploratory approach to their project. It’s the approach I try to take with most of my work, both in DH and literary studies, but I did not really have a label for it until now. In “DSC #6: Voyant’s Big Day,” the question that Katherine Bowers asks is “what can Voyant tell us about the BSC slang?” This question really lets the tool and the data do the talking. Of course, this comes with challenges because when you let the tool do the talking, the tool can create its own biases. However, rather than hinging her research question on her own memory or reading of the slang in the Baby-Sitters Club books, Bowers allows the results of this analysis to lead her to each next question. Taking a similar approach with Voyant for this lab allowed me to both get to know the tool better and ask more specific questions of the data based on the tools we used for each tab. I keep coming back to the trial and error theme. For each new tool and task, I have tried and failed multiple times before getting to a satisfying end goal. I have been required to reframe my goals and ask new questions for each step of the lab. Not taking results at face value and continuing to question and try again is really essential to the exploratory work we have been doing.

As mentioned above, letting the tools lead the way can be extraordinarily beneficial for an exploratory approach, but it also feels dangerous close to “the numbers speak for themselves,” which Chapter 6 of Data Feminism demonstrated is not at all true. 1 Oftentimes, the numbers actually say more about their context than the realities are meant to represent. I learned a lesson about making assumptions in data analysis in my exploration of the WordTree tool. Before I read a more detailed description of the tool, I made the assumption that the visualization would naturally include the most common associated words in the tree for each keyword. This was not the case, and the algorithm or system by which the associated words were decided is not given. I made an assumption that could have seriously threatened my analysis had I been doing significant work with this particular tool. While not as high stakes as the examples discussed by D’Ignazio and Klein in their books, this was a clear example of both the need to gather context and how the presentation of the data skews how it is read.

  1. \6. The Numbers Don’t Speak for Themselves. (2020). In Data Feminism. Retrieved from https://data-feminism.mitpress.mit.edu/pub/czq9dfs5