This website presents the dataset introduced in Improving Entity Ranking for Keyword Queries, by John Foley, Brendan O'Connor, and James Allan, which appeared in CIKM 2016.
We pooled our best non-learning-to-rank models (along with runs from prior work) and evaluated all methods fully to a depth of five. We did this in two stages.
Initially, we created an evaluation set as a pilot evaluation. We had graduate students in our lab create additional judgments while initially testing our new techniques, labeling only those documents that were entirely new to our approach, as nearly all of our top-5 documents were unjudged. This pilot set consisted of 609 judgments for the ClueWeb12 queries and 322 judgments for the Robust04 queries. As our goal here was to cover our unjudged techniques rapidly to ensure that our system had some promise, we did not collect overlapping annotations.
Because our additional data and the original judgments lack agreement data, we took the untuned runs from all methods with $k=100$ feedback documents, and pooled them with runs from prior work to a depth of five.
In order to obtain diversity in annotators, we had workers on Amazon's Mechanical Turk judge the (query, document) pairs from pooled runs. We limited workers to those having achieved Master's qualification (Lifetime Approval rate >=95% and more than 1000 jobs completed), and offered $0.08 per label.
We provided workers with the title and the description for each query, a warning that the results they were seeing were not ranked by any means, and asked them to judge five entities (based on title, abstract and wikipedia link) for a query in a single task. Workers were allowed to admit that they could not determine the correct judgment or skip the question. We marked these, pessimistically, as non-relevant. Because we presented five entities at a time, some entities were judged by multiple workers, based on the need to fill pages, which also allowed us to calculate agreement cheaply. We calculated inter-worker agreement as well as worker agreement with the original relevance set and our`pilot`` reference set.
Information about the collected data is available in the table below. The final relevance set we used involved majority-voting, where ties were broken toward relevance (if one person thought it was relevant, and another disagreed, we assumed it was, to be conservative). Different tie-breaking choices did not affect relative rankings of systems.Robust04 | ClueWeb12 | |
Queries | 25 | 22 |
Original Judgments | 1250 | 3306 |
Pilot Judgments | 322 | 609 |
MTurk Judgments | 730 | 1085 |
Unique MTurk Judgments | 686 | 1021 |
MTurk-MTurk agreement | 75% of 48 | 92% of 64 |
Original-MTurk agreement | 76% of 213 | 80% of 334 |
Pilot-MTurk agreement | 78% of 309 | 80% of 427 |
Final Judgments | 1946 | 4353 |
There are rougly twenty-five queries per corpus, derived from the original document-retrieval tasks. We actually collected some judgments for additional queries, but ended up excluding them in our final results, so as to maintain valid comparisons with prior work in the following sections. Removing the additional judgments from our training procedure and our aggregate measures had no significant effect on results, although it improved our confidence in the generalization of our results on an otherwise small evaluation set.
This dataset extends that which was first introduced in the paper: Michael Schuhmacher, Laura Dietz, Simone Paolo Ponzetto: Ranking Entities for Web Queries through Text and Knowledge. Proc. of CIKM 2015., which is still available from their website: Original Dataset