How to NOT Read Hemingway

Results

See all the blog posts about our distant reading of Hemingway.

Rationale

As we’ve explored in a number of different contexts throughout the semester, you can get a different sense of texts if you do pattern recognition at a slightly different angle. So far this has meant that we’ve considered a novel via its geography and a poetry collection at the macro level. And then there was that whole AMBP thing. But in each case, we did the “normal” English thing by reading and discussing the texts and then looking at them through a computational lens.

But what if we decided to do pattern recognition on some an amount of reading that we couldn’t possibly handle in the final weeks of the semester? Consider this one last crazy digital humanities experiment!

The Nitty Gritty

Collectively, we will finish building a dataset of all of Hemingway’s stories, novels, and nonfiction. (Say, “Thanks!” to last year’s class for getting us started.) Individually, you will be scanning a number of pages from one or two of his books and then processing those scans with optical character recognition (OCR) software.

Scanning

It turns out that books can be kind of hard to scan since there’s all that pesky page-turning to deal with. We’re going to simplify things by making these books less book-like and removing their bindings.

You will scan the pages that you are assigned, turning the pages into PDFs. If you have selections from multiple books, please scan those pages in two chunks, creating two different files.
There are a number of printers around the Woodruff Library that have automatic feeders and that can scan both sides of the page. You can come and do this work in the Emory Center for Digital Scholarship, where my home base is (3rd floor of the Woodruff Library). I can even show you how it works. I estimate that it will take you 15 minutes to do the scanning. The scanners will email you the file.
When doing the scanning, it is important that you set scanner to work in grayscale rather than black and white. If you don’t know how to do this, ask! The default on scanners around the library tends to be 200 dpi, and that should be good enough for what we’re doing.
When you have received the file, rename it to lastname-nameofbook-pages; for example, croxall-forwhombelltolls-75-254.pdf. If you’re working with two different books, please break this into two different files. Email me a copy of the PDF(s).

OCR

To do the OCR work, you will need to bring your PDFs to the Emory Center for Digital Scholarship. Four Macs in the Center have Prizmo installed on them, which you will use.
When opening Prizmo, choose “New Document…” and then drag-and-drop your PDF onto the window.
Make sure that the type is set to “ABC” (AKA “Text”). You shouldn’t have to make any other adjustments to the settings.
Select all of the pages in your file by clicking on a single page image and choosing Edit > Select All (⌘A).
Then click “Recognize.”
Sit back and relax as Prizmo processes all of your text.
Look through each page of the text, checking for misspellings or strange punctuations. I’m not asking you to read through each and every single word. But look for something that might catch your eye. Look for red-underlines, which indicate misspellings or strings of nonsense characters. Make corrections in the right-hand side of the window if necessary.
When you’ve finished all of your pages, choose Export > File > Regular Text. Make sure you include all pages.
Name your file lastname-nameofbook-pages; for example, croxall-sunalsorises-100-286.txt and email me the outputted file.
If you have pages from more than one book, you will have to go through this process with your two different files. But don’t worry, I made the number of pages equal for everyone.
Important: Please do not spend any more than four hours on the OCR, even if you don’t finish.
You need to get me all of your files no later than 12pm on Tuesday, 5 May. This will give me the chance to compile everything in time for our final.

Final

For the final exam, we will again use our old friend Voyant to find patterns in the Hemingway data set. You will work in groups again and have free range of the tools. Since we won’t have read everything, you’ll have a much different opportunity to draw conclusions than with Duffy’s poetry. Who knows, maybe even the word collocates will be interesting this time!
In your groups, you will write a 800-word (minimum) blog post about the patterns that you’ve found and the interpretations that you derived from them. Finally, your group will also give a brief presentation on the work that you’ve done.
Then we’ll all high-five each other and ride off into the sunset.

Grading

This project, including the final group blog post, is worth 10% of your grade in the class. As an experimental class project, you are not being graded on what you and your group find about Hemingway’s work. After all, we simply don’t know what we’ll find—if anything. Instead, you’ll be graded on (1) whether you accomplish all the parts of the assignment (pass / fail); (2) how engaged you are with the work and your group; and (3) how well you apply the method of screwing around / pattern recognition / interpretation we’ve been embracing throughout the semester.

Credits

This assignment was designed by Brian Croxall and is licensed with a Creative Commons BY (CC BY 4.0) license. Special props to Stewart Varner for telling me to stop thinking about Whitman; David Mimno and Ted Underwood for encouragement; and Paul Fyfe and Jason B. Jones for an idea that I gleefully ripped off.

Name	Book(s)	Pages
Adams	ToS / SS	all / 379-428
Ahmad	SS	429-472, 567-650
Ahmed	THHN	1-142
An	THHN/ FWBT	143-262 / 1-24
Au	FWBT	25-168
Bacon	FWBT	169-312
Brezel	FWBT	313-462
Casseday	IS	9-154
Chung	IS	155-296
Glenn	IS	297-446
Keuler	GoE	3-146
Kronfeld	GoE / DA	147-247 / 1-42
Lim	DA	43-186
Nurse-McLeod	DA / GHA	187-278 / 1-50
Penney	GHA	51-194
Pownall	GHA / AMF	195-295 / 15-58
Schreiber	AMF / DS	59-165 / 43-80
She	DS / UK	81-206 / 1-18
Siddiqi	UK	19-120
Smith	UK	121-232
Woodworth	UK	233-344
Croxall	UK	345-445

Title Key

ToS = Torrents of Spring
THHN = To Have and Have Not
FWBT = For Whom the Bell Tolls
IS = Islands in the Stream
GoE = Garden of Eden
DA = Death in the Afternoon
GHA = Green Hills of Africa
AMF = A Moveable Feast
DS = The Dangerous Summer
UK = Under Kilimanjaro