How to NOT Read Hemingway


See all the blog posts about our distant reading of Hemingway.


As we’ve explored in a number of different contexts throughout the semester, you can get a different sense of texts if you do pattern recognition at a slightly different angle. So far this has meant that we’ve considered a novel via its geography and a poetry collection at the macro level. And then there was that whole AMBP thing. But in each case, we did the “normal” English thing by reading and discussing the texts and then looking at them through a computational lens.

But what if we decided to do pattern recognition on some an amount of reading that we couldn’t possibly handle in the final weeks of the semester? Consider this one last, crazy digital humanities experiment!

The Nitty Gritty

Collectively, we will create a data set of all of Hemingway’s stories and novels. Individually, you will be scanning a number of pages from one or two of his books and then processing those scans with optical character recognition (OCR) software.


It turns out that books can be kind of hard to scan since there’s all that pesky page-turning to deal with. We’re going to simplify things by making these books less book-like and removing their bindings.

  • You will scan the pages that you are assigned, turning the pages into PDFs. If you have selections from multiple books, please scan those pages in two chunks, creating two different files.
  • When doing the scanning, it is important that you set the resolution to 300 dpi. If you don’t know how to do this, ask! The default on scanners around the library tends to be 150-200 dpi, and that isn’t good enough for what we’re doing.
  • There are a number of printers around the Woodruff Library that you can use for free that have automatic feeders and that can scan both sides of the page. You can come and do this work in the Emory Center for Digital Scholarship, where my home base is (3rd floor of the Woodruff Library). I estimate that it will take you 15 minutes to do the scanning. The scanners will email you the file.
  • When you have received the file, rename itto lastname-nameofbook-pages; for example, croxall-forwhombelltolls-75-254.pdf. Email me a copy of the PDF(s).


  • To do the OCR work, you will need to bring your PDFs to the Emory Center for Digital Scholarship. Five Macs in the Center have Prizmo installed on them, which you will use.
  • When opening Prizmo, choose “New Document…” and then drag-and-drop your PDF onto the window.
  • Make sure that the type is set to “ABC” (AKA “Text”). You shouldn’t have to make any other adjustments to the settings.
  • Select all of the pages in your file by clicking on a single page image and choosing Edit > Select All (⌘A).
  • Then click “Recognize.”
  • Sit back and relax as Prizmo processes all of your text.
  • Look through each page of the text, checking for misspellings or strange punctuations. I’m not asking you to read through each and every single word. But look for something that might catch your eye. Look for red-underlines, which indicate misspellings or strings of nonsense characters. Make corrections in the right-hand side of the window if necessary.
  • When you’ve finished all of your pages, choose Export > File > Regular Text. Make sure you include all pages.
  • Name your file nameofbook-pages; for example, sunalsorises-100-286.txt and email me the outputted file.
  • If you have pages from more than one book, you will have to go through this process with your two different files. But don’t worry, I made the number of pages equal for everyone.
  • You need to get me all of your files no later than 5pm on Wednesday, 30 April. This will give me the chance to compile everything in time for our final.


  • For the final exam, we will again use our old friend Voyant to find patterns in the Hemingway data set. You will work in groups again and have free range of the tools. Since we won’t have read everything, you’ll have a much different opportunity to draw conclusions than with Duffy’s poetry. Who knows, maybe even the word collocates will be interesting this time!
  • In your groups, you will write a 700-word (minimum) blog post about the patterns that you’ve found and the interpretations that you derived from them. Finally, your group will also give a brief presentation on the work that you’ve done.
  • Then we’ll all high-five each other and ride off into the sunset.


This project, including the final group blog post, is worth 10% of your grade in the class. As an experimental class project, you are not being graded on what you and your group find about Hemingway’s work. After all, we simply don’t know what we’ll find, if anything. Instead, you’ll be graded on (1) whether you accomplish all the parts of the assignment (pass / fail); (2) how engaged you are with the work and your group; and (3) how well you apply the method of pattern recognition / interpretation we’ve been embracing throughout the semester.


This assignment was designed by Brian Croxall and is licensed with a Creative Commons BY (CC BY 4.0) license. Special props to Stewart Varner for telling me to stop thinking about Whitman; David Mimno and Ted Underwood for encouragement; and Paul Fyfe and Jason B. Jones for an idea that I gleefully ripped off.

Albrecht SS 5-98
Bloch SS 99-192
Briones SS 193-284
Curtis SS 285-378
Dillman SS 379-472
Dillon SS 473-566
Drumm SS 567-650
Gesang SAR 7-186
Gonzales SAR/AFTA 187-251/3-120
Gulati AFTA 121-306
Jhol AFTA/THHN 307-332/1-162
Kazi THHN/FWBT 163-262/1-87
Levine FWBT 88-274
Lim FWBT 275-462
Lo FWBT/ACR 463-471/11-186
O’Connor ACR/IS 187-283/9-98
Qureshi IS 99-284
Stechmann IS/TaFL 285-446/13-32
Valerstain TaFL 33-182
Weiss TaFL 183-311
Zeng OMS / ToS all
Croxall GoE all