Dissertation Poetry Data

See CIIR/Downloads/Poetry for the datasets created as part of my disseration. This includes the largest publicly-available collection of poetry in the world as of May 2019: half a million pages with poetry on them from 50,000 scanned books. I plan to make larger datasets available (just a matter of limited hosting space - send me an email if interested).

Wikipedia Year Facts

  • Originally presented in Retrieving Time from Scanned Books at ECIR 2015.
  • This dataset contains 40,000 bullet points mined from English Wikipedia year pages (like 1942) extracted from the June 2013 XML dump.
  • (3.3M) ecir15.wiki-year-facts.json.gz

Entity Judgments for Robust and Clue12 Queries