As we’ve explored in a number of different contexts throughout the semester, you can get a different sense of texts if you do pattern recognition at a slightly different angle. So far this has meant that we’ve considered a novel via its geography and a poetry collection at the macro level. And then there was that whole AMBP thing. But in each case, we did the “normal” English thing by reading and discussing the texts and then looking at them through a computational lens.
But what if we decided to do pattern recognition on some an amount of reading that we couldn’t possibly handle in the final weeks of the semester? Consider this one last crazy digital humanities experiment!
The Nitty Gritty
Collectively, we will finish building a dataset of all of Hemingway’s stories, novels, and nonfiction. (Say, “Thanks!” to last year’s class for getting us started.) Individually, you will be scanning a number of pages from one or two of his books and then processing those scans with optical character recognition (OCR) software.
It turns out that books can be kind of hard to scan since there’s all that pesky page-turning to deal with. We’re going to simplify things by making these books less book-like and removing their bindings.
- You will scan the pages that you are assigned, turning the pages into PDFs. If you have selections from multiple books, please scan those pages in two chunks, creating two different files.
- There are a number of printers around the Woodruff Library that have automatic feeders and that can scan both sides of the page. You can come and do this work in the Emory Center for Digital Scholarship, where my home base is (3rd floor of the Woodruff Library). I can even show you how it works. I estimate that it will take you 15 minutes to do the scanning. The scanners will email you the file.
- When doing the scanning, it is important that you set scanner to work in grayscale rather than black and white. If you don’t know how to do this, ask! The default on scanners around the library tends to be 200 dpi, and that should be good enough for what we’re doing.
- When you have received the file, rename it to lastname-nameofbook-pages; for example, croxall-forwhombelltolls-75-254.pdf. If you’re working with two different books, please break this into two different files. Email me a copy of the PDF(s).
- To do the OCR work, you will need to bring your PDFs to the Emory Center for Digital Scholarship. Four Macs in the Center have Prizmo installed on them, which you will use.
- When opening Prizmo, choose “New Document…” and then drag-and-drop your PDF onto the window.
- Make sure that the type is set to “ABC” (AKA “Text”). You shouldn’t have to make any other adjustments to the settings.
- Select all of the pages in your file by clicking on a single page image and choosing Edit > Select All (⌘A).
- Then click “Recognize.”
- Sit back and relax as Prizmo processes all of your text.
- Look through each page of the text, checking for misspellings or strange punctuations. I’m not asking you to read through each and every single word. But look for something that might catch your eye. Look for red-underlines, which indicate misspellings or strings of nonsense characters. Make corrections in the right-hand side of the window if necessary.
- When you’ve finished all of your pages, choose Export > File > Regular Text. Make sure you include all pages.
- Name your file lastname-nameofbook-pages; for example, croxall-sunalsorises-100-286.txt and email me the outputted file.
- If you have pages from more than one book, you will have to go through this process with your two different files. But don’t worry, I made the number of pages equal for everyone.
- Important: Please do not spend any more than four hours on the OCR, even if you don’t finish.
- You need to get me all of your files no later than 12pm on Tuesday, 5 May. This will give me the chance to compile everything in time for our final.
- For the final exam, we will again use our old friend Voyant to find patterns in the Hemingway data set. You will work in groups again and have free range of the tools. Since we won’t have read everything, you’ll have a much different opportunity to draw conclusions than with Duffy’s poetry. Who knows, maybe even the word collocates will be interesting this time!
- In your groups, you will write a 800-word (minimum) blog post about the patterns that you’ve found and the interpretations that you derived from them. Finally, your group will also give a brief presentation on the work that you’ve done.
- Then we’ll all high-five each other and ride off into the sunset.
This project, including the final group blog post, is worth 10% of your grade in the class. As an experimental class project, you are not being graded on what you and your group find about Hemingway’s work. After all, we simply don’t know what we’ll find—if anything. Instead, you’ll be graded on (1) whether you accomplish all the parts of the assignment (pass / fail); (2) how engaged you are with the work and your group; and (3) how well you apply the method of screwing around / pattern recognition / interpretation we’ve been embracing throughout the semester.
This assignment was designed by Brian Croxall and is licensed with a Creative Commons BY (CC BY 4.0) license. Special props to Stewart Varner for telling me to stop thinking about Whitman; David Mimno and Ted Underwood for encouragement; and Paul Fyfe and Jason B. Jones for an idea that I gleefully ripped off.
|Adams||ToS / SS||all / 379-428|
|An||THHN/ FWBT||143-262 / 1-24|
|Kronfeld||GoE / DA||147-247 / 1-42|
|Nurse-McLeod||DA / GHA||187-278 / 1-50|
|Pownall||GHA / AMF||195-295 / 15-58|
|Schreiber||AMF / DS||59-165 / 43-80|
|She||DS / UK||81-206 / 1-18|
- ToS = Torrents of Spring
- THHN = To Have and Have Not
- FWBT = For Whom the Bell Tolls
- IS = Islands in the Stream
- GoE = Garden of Eden
- DA = Death in the Afternoon
- GHA = Green Hills of Africa
- AMF = A Moveable Feast
- DS = The Dangerous Summer
- UK = Under Kilimanjaro