FastRank alpha release
Table of Contents
TL;DR FastRank: faster CoordinateAscent for Python
Check out the code on Github. See the API in a jupyter-notebook, or grab it from pip.
I have been an avid user of RankLib throughout my IR research life. However, RankLib has not aged particularly well: it is written in Java, and is therefore hard to access from Python, and one of my favorite algorithms does not use multiple threads.
I rewrote my two favorite algorithms in Rust, and bound them to Python 3.5+.
Getting Started
FastRank, this new library, is available from PYPI, and can be installed through pip:
pip install fastrank
Note: there is no Windows support right now, just OSX and Linux. If you have a Windows machine or know how to help me set up Github Actions to create Windows builds, email me!
How do I use it? What does it do?
See this Colab Notebook for a FastRank demo. Colab is a Jupyter Notebook hosted by Google.
The current version of FastRank (0.4) supports the Random Forests1 and Coordinate Ascent2 models from Ranklib: one non-linear, and one linear. For most of my projects, these have been the best two algorithms every time.
Why did you do this?
Well, I did it in order to have my two favorite models available to me in Python and Rust. For the long version, Ranklib is the most flexible LTR tool out there (to the best of my knowledge), and its the only public, well-maintained implementation of Metzler & Croft’s outstandingly-simple Coordinate Ascent (it is both outstanding in performance and in simplicity).
Existing Learning-to-Rank Tools
The go-to learning-to-rank tools are Ranklib3, which provides a variety of models or something specific like XGBoost4 or SVM-rank5 which focus on a particular model.
Ranklib, a general tool implemented by Van Dang has garnered something like 40 citations – via Google Scholar search – even though it doesn’t have a core paper describing it. Many more papers use Ranklib without citing it.
Algorithms/Models available in Ranklib
Ranklib provides a variety of models that were important to the development of LTR as the primary approach to retrieval:
Ranklib Model | Linear? | Problem | Family | 2019 Outlook |
---|---|---|---|---|
Regresion Trees (MART) | Nonlinear | Pointwise | Tree | Useful in LambdaMART/RF |
LambdaMART | Nonlinear | Listwise | Forest Model | Use XGBoost |
Random Forests | Nonlinear | Pointwise | Forest Model | Use sklearn |
RankNet | Nonlinear | Pairwise | Neural | Use torch/tensorflow |
ListNet | Nonlinear | Listwise | Neural | Use torch/tensorflow |
RankBoost | Nonlinear | Pairwise | Boosting | Not competitive |
AdaRank | Nonlinear | Pointwise | Boosting | Not competitive |
Coordinate Ascent | Linear | Listwise | Linear | Use Ranklib |
Logistic Regression | Linear | Pointwise | Linear | Not-competitive, use sklearn |
However, over time, RankNet and ListNet have been taken over by general neural libraries like PyTorch and Tensorflow. The boosting methods: AdaRank and RankBoost have never performed competitively with the others. Regression Trees are still helpful, but only within the formulation of Random Forests (which trains many on subsamples of the data) or in LambdaMART (which trains many to optimize changes in true metrics like NDCG).
Rewriting Ranklib to be accessible from Python and more efficient is a daunting task, but if I start by prioritizing the most effective algorithms that are not available elsewhere, it turns out that I can quickly cover most of the functionality I have needed. I started with CoordinateAscent and found the setup to be so much faster than the original that I soon wanted a RandomForest learner - a quick test to see if nonlinear features are promising on the current dataset.
What’s Coordinate Ascent and why should I use it?
Coordinate Ascent directly optimizes your desired ranking measure by doing a feature-at-a-time search through an abritrary, non-differentiable space. Wikipedia and other resources disagree about whether this kind of learning process is Coordinate ascent or descent.
This model is awesome because:
- It directly optimizes the ranking measure.
- It is a linear model (and therefore efficient).
- It is a linear model which means fewer parameters and so there are limits to its ability to overfit.
- Random restarts are trivially parallelizable.
- It usually is the best. It gets very competitive results on large datasets and methods like LambdaMART are not usually that far ahead, since ranking features are already well-optimized for linearity.
My implementation is special because:
- Random restarts are actually based on random weights (not just shuffled feature order, as in RankLib).
- Random restarts are in separate threads.
- The code is cleaner and shorter.
- The random number generator used can be seeded, ensuring reproducible results given the same input files, seed, and version of FastRank.
I’ll be talking about this at TREC 2019.
-
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32 ↩
-
Metzler, D., & Croft, W. B. (2007). Linear feature-based models for information retrieval. Information Retrieval, 10(3), 257-274. ↩
-
V. Dang, Ranklib. RankLib Documentation Page ↩
-
Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). ACM. ↩
-
Joachims, T. (2002, July). Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 133-142). ACM. ↩
@misc{foley2019fastrank, title={{FastRank alpha release }}, author={{John Foley}}, year={2019}, howpublished={ \url{https://jjfoley.me/2019/10/11/fastrank-alpha.html} } }