Early Modern Care Dataset and Codebook

Technical Documentation

The dataset Early Modern Care is a dataset of 1022 data points presented in rows of a google sheet. Our data points are bibliographic information on Early Modern print texts (from roughly 1500-1700). The Early Modern print texts that we have chosen for this dataset are catalogued on the English Short Title Catalogue (ESTC) housed by the British Library (BL) and are related to our definition of Early Modern Care.

Our definition of care and this dataset are inspired by a previous digital humanities project that we created called: Early Modern Maternity and Caretaking (EMMC) Over the course of five months from August 2021-December 2021 for EMMC we curated by hand (through bibliography mining and other research methods) a list of 20 early print texts on the subject of caretaking written by women. This preliminary, hand-curated dataset is housed under the “Print Text CSV” tab of EMMC. On this preliminary CSV under column “STC” we provided links to each print edition’s catalogue in the ESTC. To fully understand the boundaries of our definition of care for the final project dataset for ENG 612, we explored the keywords labelled by BL librarians to each print text on EMMC and chose 10 keywords to use as the basis for our definition of care.

We envision Early Modern Care for our scholarly research as positive acts of caring for the whole person grounded in the body and materiality. In light of this definition, we chose the following keywords to identify texts for the Early Modern Care dataset. Our keywords are: Cookery; Home Economics; Canning and Preserving; Etiquette; Gynecology; Obstetrics; Midwifery; Midwives; Parent and Child; Childcare. Each of these ten keywords in the ESTC calls up fewer than 200 texts. We chose to focus on these keywords because they were manageable to scrape from the ESTC by one person in one sitting.

The scraping process included using the add-on extension of Zotero Research Assistant on the Chrome web browser. The extension on Zotero would obtain all of the bibliographic information for each text including all the keywords BL librarians attribute to it. These BL keywords are important information for us to retain as we hope to turn Early Modern Care into a functional and searchable database. These BL keywords are a starting point for that future phase of the project.

Once we scraped all the texts on a keyword, for instance, the keyword “Gynecology” the information would be stored on the Zotero research assistant desktop application. The scraping process involved scraping ten texts at a time as they appear in the ESTC database. To trace how we arrived at the keyword “Gynecology-Early works to 1800.” we took a look at Jane Sharpe’s 1671 edition of The Midwives Book which is catalogued in the Print Text CSV of EMMC. Then we located the keyword “Gynecology-Early works to 1800,” clicked on it in the ESTC and then clicked on the “Find other documents in the catalogue” button. Then, we found that this keyword is scrapable in one sitting (less than 200 entries) and continued with scraping 10 entries at a time.

For instance, once all 67 texts under the keyword “Gynecology-Early works to 1800” were scraped using Zotero and stored in the desktop app, we exported the Zotero folder to a CSV file on our local computers. Then, we uploaded the CSV file to Google drive and converted it into a google sheet. The raw data of each keyword was initially messy with repeats of each edition. First, we went through by hand and deleted rows that were repeats of the same information. Then we deleted empty columns and metadata columns that are unhelpful to us such as the date accessed by us the researchers. Finally, we added column G and column H. In column G, we hand assigned gender under three categories, male, female, and unknown. Any texts that we could not ascertain the gender we defaulted to unknown rather than to male. And texts by multiple authors with at least one female author we defaulted to female. Finally, the last step in our cleaning process was to add the exact keyword from the ESTC that we used to scrape, in this example “Gynecology-Early works to 1800.” This is to make our process transparent and reproducible for other researchers.

For the sake of documentation, transparency and reproducibility, we retained each original CSV from Zotero and each cleaned CSV according to keyword. Ultimately, we compiled all the keyword-based cleaned data into one master sheet, sorted it ascending according to author name, and created a codebook for our metadata fields which are below. This CSV is called “Early Modern Care Final Dataset” with the “Codebook” as a second tab on the google sheets.

Our metadata fields are as follows: A. ESTC identification number; B. Materiality of the text; C. Author name; D. Full Title; E. Publication Date; F. Page numbers; G. Gender of Author; H. Keyword Search from ESTC Subject headings; I. Shortened Title; J. Printer; K. Publication Place; L. URL of ESTC; M. Subject keywords as defined by ESTC. We acknowledge that Column M, Subject keywords as defined by the BL librarians at the ESTC, will need to be hand cleaned and organized using regular expressions in the future.

The most striking part about our process of creating the dataset Early Modern Care for the ENG 612 final project, is the reproducibility of our work. Early in this discussion of our process or, as others might think of it, our research methods, I mention the word curation. Julia Flanders and Trevor Munoz aim to define data curation in their introduction to Digital Humanities Data Curation. They begin by defining curation in the vein of early print and manuscript studies and the care of a text. This rightly has great overlap with the kind of caretaking we hope to highlight with our dataset Early Modern Care. It also speaks to the kinds of ethos we hope to build as we document our process of data collection, an activity Flanders and Munoz quote as key: “[Data curation] carries with it the burden of capturing and preserving not only the data itself, but information about the methods by which it was produced.”1 We hope that this portion of our critical introduction carries on the tradition that Flanders and Munoz seek to create.

We would be remiss to only touch upon the ethos of documenting our project. Another important aspect to understanding our process is to understand our data cleaning. Katie Rawson and Trevor Munoz discuss the lack of transparency around data cleaning and what it entails. In their chapter “Against Cleaning,” they write: “In reality, data cleaning is a consequential step in the research process that we often make opaque by the way we talk about it. That we employ obscuring language like ‘data cleaning’ should be a strong invitation to scrutinize, perhaps reimagine, and almost certainly rename this part of our practice.”2 This portion of our critical introduction pushes back against this opacity. Yes, we data clean like most digital humanities researchers. But we attempt here to explain in detail our process and to also acknowledge that our data such as the BL keywords is still messy and that is okay.

Curating and cleaning are practices that tie into the notion of radical acts of care- the care of data. Through our exploration of early modern caretaking we hope to make a connection to a 21st century kind of care that is different than the kind of early modern care we seek to circumscribe. Data care is ephemeral, abstract, sometimes singular and not grounded in the body or the material. Yet, is no more or less important than early modern caretaking.

Conceptual Framework

The Early Modern Care dataset provides a comprehensive corpus of Early Modern printed texts on caregiving. Defined by the concept of cura personalis, or “care of the whole person,” this dataset encompasses materials on caregiving according to Early Modern ideas of care. The time period this dataset covers is that of the English Short Title Catalog (ESTC), 1473-1800. This time period covers the Early Modern period and slightly beyond, accounting for the immense boom of printed texts at the end of the seventeenth century during which Early Modern ideas and values can still be traced in printed texts. Care is not necessarily easily defined, and thus creating a database of these texts is not an altogether intuitive process. The exclusion metrics were equally as difficult to define. For the current dataset, all of the texts are positive rather than punitive; the dataset chooses to focus on positive actions that can be taken to enact care, rather than instructions of how to avoid punishment or read signs. The keywords that have been scraped thus far are grounded in the body and instructions are actionable by all people. For instance, astrological pamphlets and almanacs have been excluded because they instruct their readers on how to interpret celestial signs, rather than actions they can take themselves to best interact with celestial forces. The texts are relevant across classes, and not exclusive to the elite. Grounded in bodily actions and how people move within their environment, these texts are focused on positive action and personal autonomy. The Early Modern Care project seeks to aggregate texts on care practices from throughout the Early Modern period.

This dataset addresses issues surrounding access to Early Modern texts that are often gendered in their construction. Caregiving has been traditionally gendered as a female activity and thus has not received much attention in scholarly literature, especially considering the multifaceted nature of caregiving in the Early Modern period. This dataset is an attempt to define caregiving via texts collected through a digital humanities methodology. Especially for pedagogical purposes, this dataset will allow for a definition of caregiving with tangible examples of how that care would have been enacted through textual evidence. An aggregated resource for this gendered activity connects physical objects with a more nebulous theory of caregiving practices. That women were primarily involved in these caregiving actions contributes to the vague definition and lack of concrete knowledge of these practices. Writing on actual tasks that take place in a primarily domestic environment helps to lift the shroud of mystery that is often so present surrounding women’s household labor. The open access nature of this dataset allows it to be accessed for purposes of scholarly research and pedagogy, which will help foreground these practices in scholarship and the classroom. The aggregation of these sources cuts down the research legwork that is so often a barrier to practically implementing this scholarship. The dataset allows researchers to answer questions about gendered caregiving practices, the popular market for texts on caregiving, authorship, and definitions of care. Only about 5% of texts in the dataset have female authors, but many of the titles suggest that these texts were marketed primarily to women. This dataset will help answer questions about the differences between who was regarded as an authority on caregiving versus who was buying these texts versus who was profiting off of them. Early Modern Care can also further the examinations of how authors created authority for themselves and how that authority functions in the book trade. Ultimately, how did the public book trade marketplace reflect the domestic realities of caregiving in the home?

The dataset will eventually also be a point of comparison for other similar projects, such as the Reading Early Medicine project (REM). REM “built a robust bibliographic database of all works on health and healing published in English from the dawn of print until 1700,” which contains over 2500 titles.3 Early Modern Care will share some of the same texts with REM, but the different parameters will also put this dataset into conversation with the other project. How does medicine overlap with caregiving? Where does domestic practice overlap with professional physicians? In the context of other digital humanities scholarship, the Early Modern care dataset can be informed by and enlighten other projects. This dataset is in conversation with other projects that seek to expand access to and knowledge on domestic practices, such as the Making and Knowing Project at Columbia University, and the Manuscript Cookbooks Survey. The Making and Knowing Project takes a practical and tactile approach to exploring domestic practice, actually making the recipes from an Early Modern manuscript receipt book. The Early Modern Care dataset will explore how instructions for these physical practices were disseminated through print. The Manuscript Cookbooks Survey is a practical resource for researchers, as it identifies repositories of unique manuscript items. The Early Modern Care dataset will allow researchers to put print and manuscript materials into conversations on caregiving, as manuscripts circulated alongside print materials, especially in domestic settings. Ultimately, this dataset aims to contribute to interdisciplinary conversations about gender, bibliography, labor, domestic culture, and economy across the Early Modern period.

Personal Reflection - Kate

The process of working on Early Modern Care helped me think more intentionally about the process of collaboration not only in DH but also in the humanities at large.

Almost no output in academia happens in a vacuum. Yet, some people’s labor gets more acknowledged (and acknowledged in the reward streams of academia). The relationship between graduate student research assistant and PI on a DH project was something that struck me intimately in Rachel Mann’s “Paid to Do but Not to Think: Reevaluating the Role of Graduate Student Collaborators” from Debates in the Digital Humanities. It has been shown in studies that graduate teaching assistants often outperform faculty in the classroom, and I can confirm this anecdotally from listening to colleagues recount their course evaluations. What concerns me most is what are we incentivizing in the academy? I feel like we are incentivizing a place of competition where the single-authored monograph is the gold standard of intellectual activity. When we read about data papers as we examined humanities journals such as The Journal of Cultural Analytics, I was thrilled to see datasets published and professionalized online. This helped me envision co-authoring a data paper on Early Modern Care with my colleague, Claire Richie, to publish on a similar platform. Yet, from our class discussion, I was shocked by how collaborative papers are counted in terms of tenure review. This makes me think about how I value my work on this dataset and my role as a graduate student:

Graduate students in the humanities also need to be trained to write and publish critical, interpretive work based on DH projects.4

Working on this dataset has shown me that just because something may not be valued by traditional streams of reward does not mean that it is not still worthwhile. Honestly, collecting this data was almost effortless for me. It was a task that I did for fun to destress rather than a class assignment. The reason that the dataset is only 1000 points is that we conceptually ran out of keywords to scrape. I would have kept scraping, but Claire and I need to conceptually regroup before doing so. The value of this data-work for me blends into the personal and is not simply motivated by academic reward.

Throughout ENG 612, we have discussed the labor and intellectual work that goes into creating a dataset. However, through collaboration and strong advising, this dataset has felt far from laborious. The most laborious part of the dataset is the documentation of our curatorial choices for reproducibility and transparency. This has helped me think about how I make choices in life. Perhaps this is too meta or beyond the scope of this critical introduction, but it feels like one can go through life making a series of choices each day and not think about why they made those choices. I am certainly guilty of that. However, working on this dataset has made me understand the choice behind more “scientific” work or perhaps “big dick data.” This attention to choice-making makes me a more ethical scholar and a more ethical human.

The ethics of data strikes me. This dataset that we are working on has helped me quantify in a number the percentage of women publishing in early modern England. I have learned to assert to others that women were in fact writing and publishing during the time of Shakespeare. I thought the patriarchal bias of the canon was to blame for these voices getting excluded–which is only one of the factors. A huge factor in the fact that we rarely read and teach early modern women’s writing is that in comparison, women’s writing is only a small fraction of what exists in the archives–I estimate only 5 percent. This is why this database matters. For me, it forefronts women’s writing and situates them in the context of the other male authors publishing on similar subjects in a way that no other early modern DH project has. Unlike the Pulter Project which focuses on one text by Hester Pulter, or Women Writers Online which creates an archive of texts by women, Early Modern Care contextualizes women’s writing on caretaking. It does not only foreground the “treasures” of the early modern print on caretaking but rather reimagines the archive. Andrew Prescott and Lorna Hughes give good advice in their article “Why Do We Digitize? The Case for Slow Digitization” when they write, “There is a risk that digitization programs, by focusing on making “treasures” more widely available, will reinforce existing cultural stereotypes and canonicities.”5 We at Early Modern Care are trying to rebuild the canon to highlight the work of women in holistic embodied care. This is what fascinates me most about the work that we are trying to do.

Personal Reflection - Claire

As we developed our dataset, there were several considerations that were at the forefront of my work on this project. First and foremost, both Kate and I placed value on the transparency and the replicability of our data collection processes. While scraping the data for our initial dataset was a fairly straightforward process once we determined the most efficient method, it was important to us to be able to justify and document these choices. From choosing the metadata fields to strategically selecting our initial keyword searches, we prioritized having a record of each of the curation choices we made. Which of the metadata fields scraped by Zotero were worth keeping? And what fields would we have to add manually? What seemed intuitive to me as a researcher was not intuitive from a technical perspective or from a DH perspective. Parsing the fields to determine which metadata was there because it was available for our program to take and what would be important to include for researchers was difficult. What if we were eliminating something important? Ultimately, asking ourselves ‘why?’ for every decision we made was the most effective way to counteract this worry. If we could not come up with a reason, we could reconsider the decision. It was also important that I keep in mind that the decisions we made when curating our metadata, no matter how logical they seemed, could not be neutral. As pointed out in Chapter 6 of Data Feminism, no dataset is free of bias. “Rather than seeing knowledge artifacts, like datasets, as raw input that can be simply fed into a statistical analysis or data visualization, a feminist approach insists on connecting data back to the context in which they were produced.”6 It is important for us to consider the perspective we ourselves are projecting on the dataset, as well as the context from which the data came. We must consider the affordances and limitations of our database sources, namely the ESTC, EEBO, and Women Writers Online. Their data collection processes will affect our data collection processes. As much as we would like to think that our data is ‘raw,’ even the metadata fields these catalogs include ‘cooks’ the data. There is hundreds of years of historical bias to consider in the case of Early Modern Care as well. The very nature of the dataset limits it. It will be a dataset that contains most of the known texts that have survived long enough to be cataloged by various projects. It will never accurately communicate the whole story of Early Modern caregiving, and the conclusions researchers may reach through using it will be reasonable assumptions at best. This is why I have found that working through the context and documenting our decision-making processes to be so important. As articulated b D’Ignazio and Klein: “This context allows us, as data scientists, to better understand any functional limitations of the data and any associated ethical obligations, as well as how the power and privilege that contributed to their making may be obscuring the truth.”7 Our aim is to bring caregiving practices to light, and to do that we need to prioritize our own limitations and biases, as well as those of the data we have to work with.

Personally, having the model of the Data-Sitter’s Club website was an immense help. The way the project was documented provided a guideline for transparent practices while also providing practical help and guidance for our future work on the blog that will accompany the database. One of the barriers I faced when beginning work on this dataset was a lack of technical knowledge that made the scraping process seem daunting, and the metadata selection process seem intimidating. I found DSC #3, “The Truth About Digital Humanities Collaborations (and Textual Variants!)” enlightening on this part. Maria’s journey to realizing her own importance in the DSC project, without as extensive of a technical background as her team, mirrors my own. In our partnership, I was the one without a significant DH background, and I felt that made me the less important collaborator. But as Maria explains in her post, “It’s important for digital humanities teams to foreground this “both-and” (“yes, and”?) approach, from forming research groups that meaningfully include both digital and disciplinary experts to making sure that each member knows their contributions are essential to the project.”8 What I did bring to the project was a conceptual background of Early Modern studies, and I came to understand that this expertise was important, even if it was not yet accompanied by great technical skill. Even though we are not yet at the stage of performing analysis on our corpus, I was able to consider what aspects of the dataset would be most helpful to researchers who would use it in the future and where it sits now in the available scholarship on the topic of Early Modern caregiving. Even with the necessary caveats outlined above, I still have reasonable confidence that our dataset is thorough, replicable, and helpful to audiences in Early Modern studies and beyond.

  1. Julia Flanders and Trevor Muñoz, “An Introduction to Humanities Data Curation,” Digital Humanities Data Curation, https://archive.mith.umd.edu/dhcuration-guide/guide.dhcuration.org/glossary/intro/index.html) 

  2. Katie Rawson and Trevor Muñoz, “Against Cleaning,” in Debates in the Digital Humanities 2019 (Minneapolis: University of Minnesota Press, 2019), https://dhdebates.gc.cuny.edu/read/untitled-f2acf72c-a469-49d8-be35-67f9ac1e3a60/section/07154de9-4903-428e-9c61-7a92a6f22e51#ch23 

  3. Mary Fissell and Elaine Leong, “Reading Early Medicine (beta),” accessed May 4, 2022, https://reademed.mpiwg-berlin.mpg.de/. 

  4. Rachel Mann, “Paid to Do by Not to Think: Reevaluating the Role of Graduate Student Collaborators, Debates in the Digital Humanities (Minneapolis: University of Minnesota Press, 2019), https://dhdebates.gc.cuny.edu/read/untitled-f2acf72c-a469-49d8-be35-67f9ac1e3a60/section/ea501a60-dd3c-4c22-a942-3d890c3a1e72 

  5. Andrew Prescott and Lorna Hughes, “Why Do We Digitize?: The Case for Slow Digitization,” Archive Journal, (September 2018), http://www.archivejournal.net/essays/why-do-we-digitize-the-case-for-slow-digitization/). 

  6. Catherine D’Ignazio and Lauren F. Klein, Data Feminism, (Cambridge: The MIT Press, 2020), 152-153. 

  7. D’Ignazio and Klein, Data Feminism, 153. 

  8. Maria Sachiko Cecire, “DSC #3: The Truth About Digital Humanities Collaboration,” The Data-Sitters Club. January 10, 2020, https://datasittersclub.github.io/site/dsc3.html.