Clickbait for TREC News

Table of Contents

TL;DR: Clickbait seems to help select relevant background articles.

  • A decent clickbait model can be built with no manual feature extraction over news article titles.
  • This feature is useless by itself and when combined with a single ranker, but defintely useful NDCG@5 0.385->0.421 in the presence of a more full model.

Quality Measures and Ranking

It is fairly intuitive that users prefer higher quality documents and simple methods for incorporating spam features can be very helpful for traditional ad-hoc web retrieval1.

So what makes a news quality feature? Maybe the length of an article? Maybe the section it is written in, e.g., opinion vs. finance.


At ECIR 2016, Potthast et al. won best poster (if I recall correctly) for their work on “Clickbait Detection”2.

While they had a lot of analysis of well-thought-out features, I was looking for a dataset, and labels.

I found the dataset from Chakraborty et al.3 via their Clickbait paper’s github repository. After a bit of python munging to get the data into the appropriate input shape for FastText4, I was ready to simultaneously learn task-specific word embeddings and a classifier for that:

fasttext supervised -input clickbait.train -output model -dim 32 -lr 0.1 -wordNgrams 4 -minCount 1 -bucket 10000 -epoch 20

I picked defaults for most options, except aiming for a smaller number of dimensions (for efficiency), more n-grams and a low min-count.

How did it do?

./fasttext test clickbait_model_4_32.bin data/clickbait.test
N       3173
P@1     0.959
R@1     0.959

Accuracy in the paper is reported as 93% with a SVM-based classifier. Looks like the “so-called” neural approach is doing quite well.

Learning to Rank with Clickbait

After selecting a candidate pool for the TREC News BG Track, we dump the unique set of titles for all queries to a titles.txt (we could do it for the whole collection, but that’s a bit expensive for model exploration).

We then ask for probabilities for the top-2 classes for each title from fasttext (clickbait and not_clickbait) which always add up to 1.0 from the topmost softmax layer.

./fasttext predict-prob clickbait_model_4_32.bin titles.txt 2 > title_probs
paste titles.txt title_probs > title_clickbait.tsv

With a little help from UNIX, we generate a TSV which can then be trivially loaded into other software and clickbait scores can be associated with documents based on their title field.

After a little more work:

Row (R.#) Feature-Set NDCG@5 NDCG
1 Only clickbait 0.016 0.304
2 Only rm-50-bm25 0.344 0.597
3 Only clickbait,rm-50-bm25 0.347 0.595
4 Full (minus clickbait) 0.385 0.627
5 Full (incl. clickbait) 0.421 0.638

Clickbait isn’t obviously helpful for this task: it is nearly useless by itself (R.1). Combined with only one other good feature it is potentially hurting us or is indistinguishable (R.2 vs R.3).

However, clickbait yields significant improvement in a full learning-to-rank model with around 20 other features (R.5 beats R.4 for NDCG@5).

More on this full learning-to-rank model in the upcoming TREC Notebook Paper.

  1. Bendersky, M., Croft, W. B., & Diao, Y. (2011, February). Quality-biased ranking of web documents. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 95-104). ACM. ACM-DL 

  2. Potthast, M., Köpsel, S., Stein, B., & Hagen, M. (2016, March). Clickbait detection. In European Conference on Information Retrieval (pp. 810-817). Springer, Cham. Semantic-Scholar Entry 

  3. Chakraborty, A., Paranjape, B., Kakarla, S., & Ganguly, N. (2016, August). Stop clickbait: Detecting and preventing clickbaits in online news media. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (pp. 9-16). IEEE. ArXiv 

  4. Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. 

Cite this post:
  title={{Clickbait for TREC News }},
  author={{John Foley}},
  howpublished={ \url{} }