Dissertation Poetry Data
See CIIR/Downloads/Poetry for the datasets created as part of my disseration. This includes the largest publicly-available collection of poetry in the world as of May 2019: half a million pages with poetry on them from 50,000 scanned books. I plan to make larger datasets available (just a matter of limited hosting space - send me an email if interested).
Wikipedia Year Facts
- Originally presented in Retrieving Time from Scanned Books at ECIR 2015.
- This dataset contains 40,000 bullet points mined from English Wikipedia year pages (like 1942) extracted from the June 2013 XML dump.
- (3.3M) ecir15.wiki-year-facts.json.gz
Entity Judgments for Robust and Clue12 Queries
- Detailed Description and Download
- Originally presented in Improving Entity Ranking for Keyword Queries in CIKM 2016.
- Extended version of an original dataset from Schuhmacher, Dietz and Ponzetto 2015
- (108K) clue12.mturk.qrel
- (52K) robust.mturk.qrel