TL;DR: Lessons learned from TREC 2018 News BG Linking

  • RM1 for a single document is just a LM.
  • My multi-term variants of RM based on SDM2 didn’t affect scores.
  • BM253 has a slight edge on Query-Likelihood in this collection (default parameters all around); but probably no significance to this difference.
  • Documents being in the past is a REALLY GOOD feature (NDCG@5 0.42 vs 0.35).

More Details

Last year I participated in the TREC 2018 News Track. Specifically, the background-linking task:

Given a news story, retrieve other news articles that provide important context or background information. The goal is to help a reader understand or learn more about the story or main issues in the current article using the best possible sources.

Since it was the very first year, and there were no relevance judgments available, I threw together a few approaches based on Relevance Modeling1 (aka RM). RM is typically used as an excellent query-expansion baseline: a generative model of relevance where top-ranking documents are used to estimate a language model of relevance which is then sampled to generate a large weighted query.

To be completely precise, RM defines the probability of a term \(t\) with respect to a query \(Q\) as follows:

\[ P(t|Q) \approx \frac{\sum_{d \in D_R} P(t|d) P(Q|d) P(d)} {\sum_{t} \sum_{d \in D_R} P(t|d) P(Q|d) P(d)} \]

We typically ignore the rank equivalent but hard to define probability of the document \(P(d)\), and just normalize the remainder: something like the following since we only care about weighting terms \(t\) that occur in any document:

\[ \text{Weight}(t) = \frac{1}{Z} \sum_{d \in D_R} P(w|Q) P(w|d) \]

You then take some number of highly-weighted terms (where the domain of terms are all those occurring in the relevant documents: \(D_R\) ) excepting those that occur in a stopword list, like the inquery 418 stopword list or the web-focused galago rmstop.

In the Background Linking task of TREC News, we don’t have a query \(Q\) but we do have a single relevant document: \(D_Q\) making our relevant set a single document \(D_R = { D_Q } \). We ignore the score \( P(Q \vert d) \) because we don’t have a query to score that document.

We’ve effectively just built a language model out of the seed document.

I was interested in looking at efficient alternatives to traditional retrieval models and dependencies last year, but the model that succeeded the best was one that took the top 50 terms and generated a weighted BM25 query based upon this setup.

Relevance Modeling Pseudocode

What does all that math look like in code? (Using some utilities from my Irene Query Language)

// The source query document is represented by 
val tokens: List<String> = getSourceDocumentTerms(qid)
// Initialize some data:
val counts = TObjectIntHashMap<String>()
val length = tokens.size.toDouble()
// Collect non-stopword terms:
for (t in tokens) {
    if (stopwords.contains(t)) continue
    counts.adjustOrPutValue(t, 1, 1)
// Collect highest scoring terms by ratio:
val weighted_terms = ArrayList<WeightedTerm>(weights.size())
counts.forEachEntry {term, count ->
    weighted_terms.add(WeightedTerm(count / length, term)
// Take highest-scoring terms:
val top_terms = weighted_terms.sorted().take(k).normalized()
// Make a BM25 query with term-weights:
val expr = SumExpr( { BM25Expr(it.term).weighted(it.weight) })

Duplicate Removal and Kicker Rejection

The rest of any “secret sauce” in my runs comes from temporal filtering and removal of duplicates. All my ranked lists were filtered as follows:

val skipTheseKickers = setOf("Opinions", "Letters to the Editor", "Opinion", "The Post's View")
val time = doc.published_date ?: 0L
assert(time >= 0L)

// See above math & pseudocode:
val rmExpr = documentVectorToQuery(tokens, NumTerms, targetField=index.defaultField)

// Filter out documents published after the query document:
// This turns out to be a signficant improvement!
val publishedBeforeExpr = LongLTE(DenseLongField("published_date"), time)
val finalExpr = RequireExpr(AndExpr(listOf(rmExpr, publishedBeforeExpr)), rmExpr).deepCopy()

// Get double-the required number of documents for filtering by type and title.
val results =, BackgroundLinkingDepth*2)

// Collect titles & kickers:
val titles = hashMapOf<Int, String>()
val kickers = hashMapOf<Int, String>()
for (sd in results.scoreDocs) {
    index.getField(sd.doc, "title")?.stringValue()?.let {
        titles.put(sd.doc, it)
    index.getField(sd.doc, "kicker")?.stringValue()?.let {
        kickers.put(sd.doc, it)
val titleDedup = results.scoreDocs
        // reject duplicates:
        .filter {
            val title = titles.get(it.doc)
            // hard-feature: has-title and most recent version:
            title != null && title != "null" && title != titles.get(it.doc+1)
        .filterNot {
            // Reject opinion/editorial.
            skipTheseKickers.contains(kickers[it.doc] ?: "null")

val recommendations = titleDedup.filter { it.doc != docNo }

Learned Unordered Windows Not Effective

I tried to additionally learn unordered-window dependencies of size 8, in the style of SDM2, but for various reasons this did not actually permute the ranking. Therefore 2/3 runs submitted were equivalent.

Time Filtering and Scoring Experiment:

Traditionally, people use a language modeling scorer (e.g., Query likelihood with dirichlet smoothing) to go along with RM, because RM itself is a language-modeling approach. However, BM253 has slightly better properties for efficiency: no expensive log operation, missing terms are weighted exactly zero, etc.

Also, there was a discussion of only considering previously published documents as candidates for relevance but the official judgments do not reflect this (more relevant retrieved without filtering) however, this feature really helped with precision at early ranks.

With the judgments on the 2018 data, this is what I get for NDCG@5, varying the scorer and temporal filtering:

Scorer Only Past? NDCG@5 num_rel_ret
QL NO 0.356 919
bm25 NO 0.353 961
QL YES 0.416 915
bm25 YES 0.417 885

As it turns out, my hybrid BM25-based RM queries were slightly more effective last year than my query likelihood run, and the temporal filtering turns out to be VERY EFFECTIVE.

Will this hold up in 2019? We’ll find out over 60 new queries.

  1. Lavrenko, V., & Croft, W. B. (2001, September). Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 120-127). ACM. (PDF)  2

  2. Metzler, D., & Croft, W. B. (2005, August). A Markov random field model for term dependencies. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 472-479). ACM. (PDF)  2

  3. Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995). Okapi at TREC-3. TREC-3 Proceedings at NIST 2