Below is a detailed description of the results I arrived at in my Digital Humanities Certificate project at Northeastern, with some helpful resources for anyone who would like to pursue historical topic modeling. My 3,000-word writeup on the entire project, with more references and methodological discussion, can be found here.
My latest digital humanities project is a topic model of headline news coverage about the anti-Vietnam War movement between 1964 and 1966. In February 1965, President Lyndon Johnson escalated United States military presence in Vietnam and called up 50,000 draftees. This escalation quickly galvanized the peace movement because the Vietnam War would since have a more direct impact on young people’s lives. Although students traditionally received “2-S” deferments from the draft, President Johnson would revise this policy in 1965 to deny deferments to students below a certain class rank and score on an aptitude test administered by the Selective Service. Even students who maintained their deferments had friends and family affected by Johnson’s military escalation. Both young and old had deep political or philosophical commitments that guided their activism even if their personal lives were not directly upended by the war.
Headline news shaped how a reading public understood the anti-Vietnam War movement at this time. Unlike editorials or advertisements that have a more obvious bent, headline news is ostensibly an imminent, neutral recounting of past events. But headline news is not neutral; journalists have to make choices about the selection, arrangement, and prioritization of data, which shapes the topics represented in my corpus. Similarly, my own decisions about the data that went into my corpus were not objective. One rewarding aspect of pursuing digital humanities has been learning to detect biases in other people’s data (a reflective process) while understanding my own biases when creating data (a reflexive process).
Topic modeling, a method used in the digital humanities, can improve our close readings by computationally generating patterns in words from large amounts of text. I used MALLET (Machine Learning for Language Toolkit), a program created out of UMass Amherst, to generate word associations in a collection (corpus) of 118 text files from PDFs of newspaper articles about my subject. MALLET runs from the command line (the ultimate “back-end”) on your computer, and it produces results pretty quickly. But one of the links below includes a Topic Modeling Tool that runs MALLET through a Graphical User Interface (GUI) tool that only requires changing some settings and clicking a few icons. I originally ran MALLET through the command line to get more familiar with the back-end, but eventually used the Topic Modeling Tool because it generated clean CSV data that helped me create some cool visualizations, discussed below.
References and Resources
- MALLET tutorial by The Programming Historian; Download MALLET here.
- For scholarly applications of topic modeling: Sarah Connell and Julia Flanders, “Writing, Reception, Intertextuality: Networking Women’s Writing,” Journal of Medieval and Early Modern Studies 50:1: 161-180. (Sarah and Julia also taught the course in which I produced this project, but I’m not biased: this article was very helpful!)
- For a valid criticism of topic modeling in DH, see Benjamin Schmidt, “Words Alone: Dismantling Topic Models in the Humanities,” Journal of Digital Humanities 2:1, Winter 2012, http://journalofdigitalhumanities.org/2-1/words-alone-by-benjamin-m-schmidt/#appendix
- For how to use the very simple “Topic Modeling Tool” that runs on MALLET with a one-click interface, see Miriam Posner, “Very basic strategies for interpreting results from the Topic Modeling Tool,” Miriam Posner’s Blog, October 29, 2012.
Process and Results
These are the twenty topics that MALLET generated from my corpus. MALLET does not generate topic labels; the researcher has to decide what those are. I’ve indicated in bold, on the left side, what I assessed these topics to be. The numbers in parentheses are the topic number, referring not to “most important” or “least important” like a ranking, but simply a random order. The decimal values represent the frequency of each topic’s occurrence in the entire corpus (and those values are usually between 0 and 1; MALLET may be giving values above 1 to indicate common discourses, but I have to explore this further).
Picketing (0) 0.06643 pickets napalm mrs dow picket fighting workers chemical library ceremony picketing board exchange company orange gruening unit wing plastic ion
Demonstrations (White House) (1) 1.05581 war vietnam peace viet demonstrators white american house men front 000 nam group ing signs street yesterday end people young
Rallies (2) 0.19736 rally square hiroshima power veterans burn staged rev place speakers man draft 500 park johnson gathered senator times decision thomas
Counterculture (3) 0.02398 drug love party political days san camel sopwith francisco smart market lsd doves eleven harry walkout story telegraph make working
Pacifism (4) 0.08888 draft pacifists cards mr card pacifist muste law hecklers department lot 27 fine prison burned involvement kiger peter element bearing
California Protests (5) 0.09639 berkeley oakland march marchers train parade crowd california army angels friday stop club back city mile miles line moved front
Student Protests/Draft 6 0.18946 students draft university college student test school sit high faculty colleges society chicago selective tests deferment class building began stu
Protest at Ceremonies (7) 0.02509 award student goldberg mcnamara degree scholarship secretary faculty graduate robert memorial outstanding honorary commencement amherst school students doctor senior defense
Courts and Law (8) 0.03968 court luce judge ban avenue duffy criminal tower artists di department 2d art defendants night laub 47th cuba castaldi ar
Marches (Location) (9) 0.15515 avenue street parade march square marchers women pickets marched park times carried ave wearing patriots center dubois spectators star yellow
Police/Arrests (10) 0.39567 police demonstrators arrested street began marchers time 2 members streets held demonstra left area youth demon side arrests dem set
Tax Resistance (11) 0.06387 washington taxes revenue internal money federal prof advertisement ad coffee muste petition collect prepared baez singer folk government statement pacifist
Protest Coordination (12) 1.34564 vietnam protest war mr committee york group day university policy today members president students united march states demonstration times ing
House Hearings (13) 0.03287 thc hearing pool gordon room uc huac court witness plp rep judge house proc luce witnesses jeffrey revolution staff audience
March Leaders (14) 0.06802 march sane gottlieb dr washington king peace spock flags bond thomas monument capital pledge convention senate rev con candidates georgia
Cities and Bases (15) 0.16492 boston base protests jail army atlanta cities terms county sit 20 court anti rallies papers seized cells 65 weekend guard
Academia and Artists (16) 0.05954 amherst hunger strike read college teach poets bly faculty portland ray fast began william ins winner writers major johnson feinglass
Vigils (17) 0.07197 fast july clergy ball bombing vigil church clergymen rev resumption seek north 4 square fighting truman bell independence mcnamara protestant
Capitol Protests (18) 0.06086 capitol lynd jail dellinger powell grounds sentences staughton moses 350 crittenden mississippi parris paint probation professor 800 pleaded days court
Immolation (19) 0.04156 people morrison tonight death answer wife stern struggle hotel aggression quaker security purpose lives battle treaty baltimore suicide services pentagon
Some words are very obvious (such as “pickets” connects to “picketing”), but for others I had to do close reading. For example, for “Protest at Ceremonies” (topic 7) I knew from my corpus that students at Amherst College and New York University walked out in protest when Defense Secretary Robert McNamara received honorary degrees at each institution’s commencement in 1966. For “March Leaders” (topic 14), I knew that articles in my corpus referred to the following leaders, whose names are closely associated in the topic: Rev. Dr. Martin Luther King, Jr., Benjamin Spock (famous author of a childrearing manual, turned anti-war activist), Norman Thomas (former Socialist Party candidate for president), and Sanford Gottlieb (march coordinator for the National Committee for a Sane Nuclear Policy, or SANE). Deciding each topic label was an iterative process of looking back at the texts themselves but remaining faithful to the clusters that MALLET generated. In an earlier iteration of this project, I intended to select the few topics that were most coherent and interesting. I since decided to present all of the topics, messy or not, after discussions with NULab colleagues and my NULab Research Fellowship advisor, Laura Nelson. Doing this acknowledges that topic modeling is more random, and less crystal-clear, than one may previously anticipate.
Change over time
When designing my visualizations, I was inspired by Robert Nelson’s Mining the Dispatch, a topic model of the entire run of Richmond, Virginia’s Daily Dispatch newspaper during the Civil War. Nelson not only identified topics, such as fugitive slave advertising, he also used line graphs to represent the proportion of each topic in his corpus separated by month and year.
To create my own visualizations, I used the CSV data generated by the one-click Topic Modeling Tool referenced above. This CSV had the frequencies by which every topic appeared in each of my documents. Once I named my topics, I changed the column headings accordingly. Because I named each text file according to a year-month-day convention, I could produce line graphs that identified the frequencies of each topic over time.
For these line graphs, I focused exclusively on headline news between March 1964 (the earliest month/year in my corpus) and December 1965. Out of my 118 text files, 44 were from 1964 and 1965 (CSV). I narrowed the scope of my visualizations to make them more useful to readers. When I attempted to visualize the entire corpus, or even just the year 1966, the line graph was one giant blur. I mostly excluded topics where the graph only showed a single spike in one month, but kept one example of that below. In a future stage of this project, I will keep working with visualization tools to improve my ability to represent a much larger corpus.
See individual photos of the graphs here.
Some conclusions that become very clear:
For Students and the Draft, the frequencies increase significantly starting in March 1965 after President Johnson’s announcement the previous February that the military would be conscripting 50,000 people for Vietnam.
The Protest Coordination topic has frequencies almost always above 0 because the headline news coverage regularly focused on the leaders and personalities involved in planning anti-war protests, especially notables like Columbia sociologist Staughton Lynd, Dr. King, or David Dellinger (one of the Chicago Seven in 1968).
With the Marches (Location) topic, there is an uptick in reporting about the many marches taking place in the streets and parks of cities like New York, after the Tonkin Gulf resolution in August 1964 committed the US to officially fighting a war in Vietnam (to say nothing about the US military’s longer involvement during the 1950s).
The other graphs may not suggest immediately significant conclusions, but that’s because visualizations alone do not tell the story. The researcher always has to connect the visualization to the texts themselves, but the visuals operate as useful representations of trends in the corpus (especially when a corpus is very large, which makes close reading one-by-one impossible).
While I excluded topics where the graphs did not have an interesting trend line (due to the limitations of my corpus), I’ve included one below because it suggests interesting future research questions:
This graph, representing vigils, has two spikes in March and May 1965 with no occurrences elsewhere. This may be an issue with the corpus, or with ending my visualizations at December 1965. But anti-war vigils might have truly been less frequent (or less reported on) than marches and rallies between 1964 and 1965. One of the most interesting vigils, held by SDS members half a mile away from President Johnson’s Texas ranch, took place in December 1966. If I expand the visualization to 1966, there may be other vigils leading up to that moment which made the tactic popular for non-religious organizations in addition to clergy against the Vietnam War.
Investigations such as these demonstrate topic modeling’s potential as a tool for examining change over time, and for understanding more clearly how mainstream news outlets chronicle social movement histories.
Thanks for reading!