Dataset Analysis

Link to The Reading Early Medicine Description

I selected a dataset from the Reading Early Medicine Project to analyze for this lab. The Reading Early Medicine project has gathered records from the English Short Title Catalogue (ESTC) of texts that have to do with health, healing, and medicine printed before 1700. The project has made the records available in a searchable database, as well as available for download in various CSV files. I have selected the ‘Remedies, Multiple’ dataset to use as my primary dataset for this lab since it contains the largest number of texts, but the datasets for other genres and topics are formatted in the same way, so my analysis can apply to the project as a whole.

One of the strengths of this dataset is simply that it exists. That all of these texts are now located in one central database, that otherwise would take hours of searching by different keywords and authors, is incredibly helpful for researchers in a variety of disciplines. The thorough nature of the search and dataset building process is one of its great strengths. The project’s website describes their search process as including a variety of keywords that have been truncated to allow for multiple spellings and versions, such as “medic*.” Some of these keywords might be intuitive even for a non-scholarly researcher, but many of them require period-specific and genre-specific knowledge. The inclusion of works by classical authors like Hippocrates and Galen that had been reprinted in the early modern period was also helpful, as ancient works reprinted may not share many of the keywords that were part of the original search terms. The project directors were incredibly through as they built the database. I am reminded of The Data Sitter’s Club article that discussed that members of DH projects with background knowledge are just as important to the success of projects as those with technical skills.1 The knowledge of early printed texts and the language of medicine is crucial to building a resource like this one. The website lists a separate IT team that built the infrastructure of the site and database, while the topic-specific work has been left to those with more topical knowledge.

The structure of the multiple datasets available for download enables the user to specify the specific genre or topic that might be relevant to them. One minor oversight is that the entire corpus of texts is not available for download, only the datasets organized by particular genres or topic are available. It is difficult to get a sense of the entire corpus this way, as it would take downloading the CSV for each genre to have access to the entire corpus of texts. Apart from this technical oversight, there is a lot of value in the genre/topic model under which the dataset operates. Both genre and topic categories are defined on the website for any researchers who may not be familiar with genres from the early modern period and to clarify the topic labels imposed by the project directors. Genres were fairly defined terms within the early modern period, but one genre might cover multiple topics. Researchers who may seek to get a sense of the genre traditions in early medical texts can fairly easily seek out certain genres and cross reference the topics that appear in them. The datasets are presented in a way that are easily used in conjunction with visualization tools like Voyant that would, in a preliminary search, give a broad view of a certain genre or topic. This could point researchers to specific texts for close readings and deeper research on their own using Early English Books Online (EEBO) or in-person archival work. The dataset I am focusing on is the topic of “remedies, multiple” and this topic appears alongside other topics in books categorized by genre. For instance, in the text The Plagues Approved Physitian is listed under the “plague tract” genre, but it contains the topics of “remedies, multiple,” “disease, single,” and “plague.” Listings like this one could help answer a research questions about how many plague tracts contained remedies for the plague versus solely information about the plague.

The database is also searchable by printer and stationer street names, which gives researchers even more opportunity to track down specific data relevant to their topic of interest. However, while the database contains this field, the CSVs do not. Including this information on the downloadable datasets would increase what researchers can accomplish with the datasets they are given. It is possible to browse the location labels and printer names on the website itself, but having that data in hand would allow researchers to utilize that information for visualizations.

While most of the data has an excellent editorial rationale on the website, there are a few elements of the dataset that I have questions about regarding the decision-making process of inclusion and exclusion. The date range is the most pressing of those in my mind. Mary Fissell’s original date range extended from 1640-1800, the date range covered in her monograph. However, when the concept shifted to the database, Leong and Fissell shifted the date range to encompass ‘early medicine’ and the range was shifted to 1480-1700. The early modern period has long had fuzzy boundaries that occasionally bleed into the period of the restoration and the long eighteenth century. The tradition of early medicine extends into eighteenth century texts, so I would appreciate more documentation about this decision. More texts could easily be added, as the ESTC’s range extends into 1800.

I am excited to see the next stages of this project, as the website lists that the next description, author’s occupation, is currently under construction. Integrating more information about the authors into the database would be immensely helpful when considering questions of authority in early medical writings. This is particularly relevant for books authored by women, for which there is already a separate CSV available. The author occupation category can guide researchers towards further close readings of paratexts to determine how authors of early medical texts figured their authority in relationship to their occupation. Syllabi incorporating the REM project are also ‘coming soon’ and there are already multiple suggested teaching activities that would integrate the database into the university classroom. While the existence of this information combined in a searchable database in itself is a immensely helpful resource, the downloadable CSVs, pedagogical materials, and fairly extensive rationale allows makes the project useful across multiple contexts.

  1. DSC #3: The Truth About Digital Humanities Collaborations (and Textual Variants!)