Digital Humanities Day 1 (Notes)

8:30 AM

The Full Spectrum Text Analysis Spreadsheet – David L Hoover

  • Idea – recently people are using more words in text analysis (600-4000). What if we look at words neither frequent nor rare, as well as those that are very rare.
  • Burrow’s Zeda and Oda (sp?) don’t use statistical tests, don’t need to be statistically valid.
  • Authors style markers
  • Look at wordlist of 30,000 words by Willa Cather and Edith Wharton
  • Counts consistency of use, rather than frequency – maybe consistency is more important?
  • Words sorted by characteristic
    • Cather – until
    • Wharton – till
    • Can see what is characteristic, what is avoided
    • Percentage present + percentage absence = distinctiveness score
    • Get a ‘distinctivness score’
    • Graph of vocabulary – remove words that appear only once – then only 500 most frequent words
    • Do you get better results looking at the whole vocabulary? It makes for better rhetoric – “I used all the words”
    • Content vs. style – a lot of overlap…

 

This is Not a Book: Experimental Literature as a Prototype – Aaron Mauro, UVic

  • Experiment – tells how contemporary authors are engaging with the digitial environment when they publish in print. Anticipate being digitized.
  • Jonathan Safran Foer – Tree of Codes
    • Die-cut 1937 poem by Schultz, killed by Nazis
    • ‘conspiracy of winking’ – ‘visual whispers’
    • Physical paper is what makes this interesting
    • Angle brackets to cite how many pages ‘deep’ a line is. New mark-up use
    • Language of deformation – Lisa Samuels, someone else – “The end of the end of the book”- an accounting of what happens in the process of digitiziation
    • Deconstruction as generative – Mauro has made HTML mock ups:
      • Use CSS3 to make a multi-coloured page turn
      • Trying to mimic what can be done with paper as a surface (not always well supported by browsers)
      • Each word group has to be marked up separately – 1100 lines for a ten page citation
      • Work is ‘buried’ – can’t see that the pages have been cut by hand (paperback costs $45)
      • Wanted to make words ‘slide’ slightly as page turned – scaled to 89%

Fine-tuning Stylometric Tools: Investigating Authorship and Genre in French Classical Theatre – Christof Schoch

  • Corpus: French classical drama 32 & 40 plays
  • No tragedies in prose – verse only, so corpus is unbalanced
  • Mix of well known and less known authors
  • Problem: genre, form, author
    • Genre: Tragedy/comedy
    • Form: verse/prose
    • Authorship is not always the strongest category – prose vs. verse is strongest classification
    • Principle Component Analysis – colour scheme for author and type
    • Parts of the word list: three most distinctive words: ces, mort, frère
      • Different distributions over corpus
        • Ces is more in one author than the other
        • Mort is more in both author’s tragedies
        • Frere confuses things
        • Graphs –  cluster analysis pairs – author vs. genre signals
        • Slight trend for genre to take over at around 1100 words – more words mean genre more important than author
        • Doesn’t work for author vs. form
        • Conclusion: include plays heterogeneous as to their genre if you don’t gi beyond 1100 words (preliminary suggestion)
        • Don’t include plays that are heterogenous as to form, even if youonly use top words
        • Clustering and attribution can become more reliable can include more and more diverse text. (important to you can balance your corpus.
        • Individual words seem to make a big difference,
        • Figure it out à Dragonfly.hypotheses.org/424

Are Google’s Linguistic Prothesis Biased Towards Commercial More Interesting Expressions? –  Anna Jobin and Frederic Kaplan

  • Commodification of words: search ‘holiday’ – get many holiday ads, means search is free b/c they make $3 billion a month
  • Words come with a suggested bid-value for advertisers
  • There are expensive and cheap words, with seasonal fluctuations (ie ski holiday)
  • Commodified words are a specific lexicon – not all words are commodified, some commodified words are in no dictionary (misspelled words)
  • All commodified words form a lexicon, one for each language and for each platform – Googlish, Googlais, Googlisch
  • Linguistic prosthesis: “did you mean” autocorrect, related searches, autocompletion
    • Autocompletion – algorithm shows up while you’re typing
    • Mediate between our thoughts and writing , between our intentions and the expression
    • Releationship between commodified lexicons and linguistic prosthesis
    • Impact of autocompletion algorithms –
      • A(x) -> {s}
        • Association between a string x in the query field and a group of strings (autocomplete)
  • Optimal experiment design algorithms (for situations when asking questions in an environment is cstly – find criteria to optimize the amount of information acquired by queries
  • Case study – around the word holiday, 790 search terms, 7756 non-redundant search queries
    • Constant location, no optional user info, any match mode
    • Got 11,00 suggestions, most are expressions rather than words, no suggestions for some autocompletions for ‘adult’
    • Majority of expressions above minimal bid (don’t know, but suspect economic activity around all autocomplete)
    • May outsource, distribute access.

Digitizing Serialized Fiction – Kirk Hess

  • “The Farmer’s Wife” – a farm newspaper
  • There is a LiBGuide for Serialized Fiction in the Farm Field and Fireside Collection
  • Early Farm papers usually had a ‘women’s section’ with a bit of fiction
  • Farmer’s Wife was published from 1897-1939, 1906-1939 digities in FF
  • Digitization process – select newspapers, create page images, segment article
  • Use Olive Actie Paper/moving to Veridian)
  • Hard to find fiction in serialized newspapers…usually no metadata, OCR is bad, articles span multiple issues without links between them.
  • OCR issues – crowd-source some OCR corrections in Veridian, not easy to automate
  • Prototype – Omeka/Scripto. Hired 400 students, got 700 serials in the archive
  • Complete story – The Mysterious McCorkles!
  • Looked at Islandora, but took much programming time.
  • ‘How can we prioritize work so important text is corrected first?
  • How to identify serialized fiction without having to find it manually?
    • Common words in fiction – chapters, ‘to be continued,’ ‘the end’
    • Topic, genre, and theme (aimed at women) – romance, children’s, holidays.
    • Main character’s name repeats often, could be caught? Makes topic modeling harder, but better for identifying fiction.
    • Word frequency using NLTK – tried 50 fiction and 50 non-fiction using top 2000 words. Seemed to work?
    • Veridian give direct access to articles (index in Solr)

10:30

Uncovering Reprinting Networks in Nineteenth Century American Newspapers  – Ryan Cordell, Elizabeth Dillon, David Smith

  • Newspapers freely copied from other papers
  • “The viral texts of the 19th century”
  • What went ‘viral’ tells us about values, priorities and interests of Americans
  • Can we get at this culture of reprinting with digitization?
  • Can’t ‘search’ unless you know the strings to search
  • Dataset – LOC’s Chronicling America project – Antebellum period (1833-1861 and earlier)….about 41,000 issues of 132 newspapers.
  • No article breaks! Need to find aligned passages and build text clusters
  • Not all reprints and copies are interesting – ads, “Notice to Advertisers” etc.
  • Issues – no known boundaries, much longer texts, noisy OCR
  • Algorithm – detect sets of pairs of newspaper issues with overlapping text, find regions of alignment, link issues that align to the same region of the same other issue
  • “the inverted index” – 5 –grams index
  • What kinds of texts are we finding? Inaugural address of Pres Buchanan, (is there editorial text around it,), recipe for making starch (46 reprints), “A Beautiful Reflection,” treaties, Soldiers Bill of 1850  (what makes these a reprint from another paper rather than something that’s distributed) , weights and measures list, “Interesting statistics”
  • “What makes a text go viral?” Multiple contexts in which these texts have meaning – it’s not that the text is stunning, but that it works for a variety of purpose (‘it has legs’)
  • Modeling the systems of nineteenth-century print culture: using GIS to do comparative mapping for reprinted texts- i.e. articles about slavery do not get re-printed in Midwest while other stuff did
  • How many people live within 5 miles of a paper?
  • Transportation networks – print lines up with the railroad
  • Networks– how do they align with religious and political communities?
  • A lot of central nodes are not in major cities
  • Does the network change based on postal law? Can we meaningfully account for circulation?
  • Lag time between first printing and re-printing – usually  peaks 2 weeks after, and then around 5 years – i.e. articles reprinted faster during Mexican war – driven by ‘breaking news’
  • ‘fast’ terms – texas, whig, corpse, government,
  • ‘slow’ terms – love, young, family, sweet (stories and poems reprinted later?)

Literary Geography at Corpus Scale – Matthew Wilkens

  • Literary geography of American fiction before and after the Civil War.
  • Many narratives with clear geographic implications – how much confidence can we attach to these narratives?
  • Magnitude of changes in ‘periodizing’ historical events
  • The Corpus – 1050 novels ….ack missed this slide
  • Need to pull out locations (Standford Cord NLP), attach them to geographic info (Google geocoding API)
  • 250,000 instances of 30,000 unique strings
  • Throwing out low-frequency stuff makes data weaker, but gives you a manageable manual workload – 1500 stings.
  • Point locations – city level and lower
  • Nations – exotic locations are used less specifically (no one mentions Russian/Chinese cities, but they do mention the countries)
  • About 55% of named places are in the US, 45% are named places in the rest of the World
  • Within the US – differences before and after civil war – on the whole very similar, and this is the ultimate periodizing-event
  • Corrolation between where people live and the places that get written about – slowly moving westward (both census and writing), literature moving slightly more southward too. So what is driving deviations?
    • Dunning’s Law of Likelihood  – over-representation in California, New York, Massachussettes, Virginia. Underrepresented – Indiana is the most underrepresented state (compared to population), Midwest fully underrepresented.
    • Degree of change in degree of representation – attention shifted away from Texas (shift away from focus on Mexican War), Oregon (end of the Oregon Trail), Minnesota (population shift)
    • Urbanization not depicted in state-level data, but do see it in the size of the cities. Big demographic shift not always represented. Percentage of attention paid to cities is much higher than people living in cities…but fairly steady as more people move there
    • Literature not forward-looking, Chicago population explodes, but literary usage does not until into the 1870s.
  • Takeaways –
    • more world attention than expected
    • New England does not dominate as expected
    • Urban centres heavily overrepresented
    • Demography and current events are important
    • Continutity is significant before and after the war
  • New results: dealt with raw counts or population, but now thinking about other variables – lots of historical and economic info at state/county level…manufacturing output, immigrants, slave population, newspaper circulation – how can they be brought into such an analysis?
    • ‘random forest’ approach – machine learning, series of decision tress based on random selection of variables, rank variables by relative increase in error if substituted.
    • Best predictor of literary mentions on the state-level is number of newspapers. (But population of course affects how many papers are published)

My notes from the afternoon are too scattered to share – more a reflection of my own energy level than the presenters’ efforts!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s