Improving Entity Ranking for Keyword Queries: Dataset

This website presents the dataset introduced in Improving Entity Ranking for Keyword Queries, by John Foley, Brendan O'Connor, and James Allan, which appeared in CIKM 2016.

The collected judgments, in TREC qrel format, represent whether a given entity is relevant to the query. Queries are drawn from the TREC Robust track and the TREC Web track.
Due to space constraints, we present an extended explanation of how this dataset was collected below, that includes a few additional details not present in the published paper.

We pooled our best non-learning-to-rank models (along with runs from prior work) and evaluated all methods fully to a depth of five. We did this in two stages.

Initially, we created an evaluation set as a pilot evaluation. We had graduate students in our lab create additional judgments while initially testing our new techniques, labeling only those documents that were entirely new to our approach, as nearly all of our top-5 documents were unjudged. This pilot set consisted of 609 judgments for the ClueWeb12 queries and 322 judgments for the Robust04 queries. As our goal here was to cover our unjudged techniques rapidly to ensure that our system had some promise, we did not collect overlapping annotations.

Because our additional data and the original judgments lack agreement data, we took the untuned runs from all methods with $k=100$ feedback documents, and pooled them with runs from prior work to a depth of five.

In order to obtain diversity in annotators, we had workers on Amazon's Mechanical Turk judge the (query, document) pairs from pooled runs. We limited workers to those having achieved Master's qualification (Lifetime Approval rate >=95% and more than 1000 jobs completed), and offered $0.08 per label.

We provided workers with the title and the description for each query, a warning that the results they were seeing were not ranked by any means, and asked them to judge five entities (based on title, abstract and wikipedia link) for a query in a single task. Workers were allowed to admit that they could not determine the correct judgment or skip the question. We marked these, pessimistically, as non-relevant. Because we presented five entities at a time, some entities were judged by multiple workers, based on the need to fill pages, which also allowed us to calculate agreement cheaply. We calculated inter-worker agreement as well as worker agreement with the original relevance set and our`pilot`` reference set.

Information about the collected data is available in the table below. The final relevance set we used involved majority-voting, where ties were broken toward relevance (if one person thought it was relevant, and another disagreed, we assumed it was, to be conservative). Different tie-breaking choices did not affect relative rankings of systems.

Analysis of Judgments Collected locally and on Mechanical Turk

Queries 25 22
Original Judgments 1250 3306
Pilot Judgments 322 609
MTurk Judgments 730 1085
Unique MTurk Judgments 686 1021
MTurk-MTurk agreement 75% of 48 92% of 64
Original-MTurk agreement 76% of 213 80% of 334
Pilot-MTurk agreement 78% of 309 80% of 427
Final Judgments 1946 4353

There are rougly twenty-five queries per corpus, derived from the original document-retrieval tasks. We actually collected some judgments for additional queries, but ended up excluding them in our final results, so as to maintain valid comparisons with prior work in the following sections. Removing the additional judgments from our training procedure and our aggregate measures had no significant effect on results, although it improved our confidence in the generalization of our results on an otherwise small evaluation set.

This dataset extends that which was first introduced in the paper: Michael Schuhmacher, Laura Dietz, Simone Paolo Ponzetto: Ranking Entities for Web Queries through Text and Knowledge. Proc. of CIKM 2015., which is still available from their website: Original Dataset