Diagnosing bert with retrieval heuristics

Recently, my first full paper as a PhD got accepted at ECIR. It mainly discusses how IR got taken by storm by Transformers (Follow this for more information on these) and on how we can analyse these methods using axioms-based techniques.


In the past few years, the field of Information Retrieval was hit by a storm that (almost) noone saw coming. While Deep Learning models, like DRMM and DUET were proposed for ad-hoc retrieval, we did not saw the same massive impact and improvement as other fields saw, like Computer Vision or voice recognition. In the past few years, the field of Information Retrieval was hit by a storm that (almost) no one saw coming. While Deep Learning models, like DRMM and DUET were proposed for ad-hoc retrieval, we did not see the same massive impact and improvement as other fields saw, like Computer Vision or voice recognition.

However, with the introduction of the Transformers architecture, Natural Language Processing finally experienced their own “ImageNet moment”, with papers such as BERT, XLNet and derived models leading every single leaderboard available. Therefore, it was only a matter of time until someone tried to apply these methods to IR, given the proximity between IR and NLP (we can discuss this over a beer at ECIR if you want to).

And in fact, it did happened. Ad-hoc retrieval, the task of ranking a list of documents, given a single query, has long resisted neural approaches, especially on standard IR collections, like Robust04 (Jimmy Lin has a nice, unofficial “leaderboard” for this old but hard to beat dataset here). In fact, it was only after BERT appeared that we started to see results that were significantly better than a traditional BM25+RM3 approach, with the first papers that actually surpassed the original TREC results, from 2004, appearing in SIGIR last year (Like CEDR).

However, it’s still not clear what exactly makes BERT and other approaches so effective for IR. While some works have tried to investigate BERT further (Like Tenney et al. and Clark et al.), not a lot of attention has been devoted to analysing BERT in the IR context.

Therefore, inspired by Rennings et al., we created a number of diagnosis datasets, an model-agnostic approach for analysing retrieval models. The main idea is simple: We create a number of datasets that are designed to fulfill one specific axiom (for instance, a document with more terms that are present in the query should be ranked higher that one that has less terms from the query). Therefore, by crafting these datasets with pairs of queries and documents (or tuples with one query and two documents) that we know the correct order that the model should rank them, we are able to gain insights on which instances the model (BERT, in our case) performs well and in which it does not.

In practice, we fine tune a DistilBERT model on the 2019 TREC Deep Learning track dataset, and created 9 datasets based on axioms for analysing the resulting model. Specifically, we used the following axioms: TFC1, TFC2, M-TDC, LNC1, LNC2, STMC1, STMC2, STCM3and TP.

Diagnostic Datasets

Each axiom has two forms. One “original” and one “relaxed”. Since most of these axioms are designed to analyse a retrieval function analytically, they assume very strict and artificial query and document scenarios, and must be “relaxed” in order to work in a real-world dataset. For instance, the axiom STMC1. It’s originally defined as: given a single-term query $Q={q}$ and two single-term documents $D_1={d_1}$, $D_2={d_2}$ where $d_1\ne d_2 \ne q$, the retrieval score of $D_1$ should be higher than $D_2$ if the semantic similarity between $q$ and $d_1$ is higher than that between $q$ and $d_2$. An informal description of these axioms can be seen below:

Heuristic Informal description
TFC1 The more occurrences of a query term a document has, the higher its retrieval score.
TFC2 The increase in retrieval score of a document gets smaller as the absolute query term frequency increases.
M-TDC The more discriminating query terms (i.e., those with high IDF value) a document contains, the higher its retrieval score.
LNC1 The retrieval score of a document decreases as terms not appearing in the query are added.
LNC2 A document that is duplicated does not have a lower retrieval score than the original document.
STMC1 A document’s retrieval score increases as it contains terms that are more semantically related to the query terms.
STMC2 The document terms that are a syntactic match to the query terms contribute at least as much to the document’s retrieval score as the semantically related terms.
STMC3 A document’s retrieval score increases as it contains more terms that are semantically related to different query terms.
TP A document’s retrieval score increases as the query terms appearing in it appear in closer proximity.

As you can see, assuming that a document has only one term and a query also only has one term is not exactly applicable to real datasets. Therefore, we find samples of queries and documents in the dataset that closely resemble the original conditions and construct our datasets based on these examples. For instance, in STMC1, we allow $D_1$, $D_2$ and $Q$ to be arbitrarily long, but with the same number of queries terms and the semantic distance of the documents, sans the query terms, (measured with GloVe) to be smaller than a given threshold.

We repeat these for all of the above axioms, and end up with 9 unique datasets for analysing how a model fares in each of these.

The first interesting find is that, for most axioms, they simply do not hold in more than 50% of the samples in our datasets. It means that, for one pair of documents $D_1$ and $D_2$ where, according to the axiom, $D_1$ should be more relevant than $D_2$, in more than half of the samples, this is not true in relevance judgements of the dataset. This can, however, be partially explained by the fact that the TREC dataset is composed almost entirely of shallow judgments, where only one relevant document per query exists. However, this explanation is not enough for justifying such large disagreement.

Diagnostic dataset size 119,690 10,682 13,871 14,481,949 7452 3,010,246 319,579 7,321,319 217,104
Instances with a relevant document 1,416 17 11 138,399 82 20,559 19,666 70,829 1,626
Fraction of instances agreeing with relevance 0.91 0.29 0.82 0.50 - 0.18 0.44 0.63 0.35

The second finding we had is that, despite vastly superior than a traditional model, like Indri QL, DistilBERT Does not fulfill any axiom better than it. In fact, this is somewhat in line with previous works like Rennings et al., where deep learning models also did not fulfill any axiom any better than traditional models. However, differing from previous works, Bert is in fact much superior in retrieval performance, despite not fulfilling the axioms. (The only axioms it does fulfill slightly better than QL is TPand STMC1)

QL 0.2627 0.3633   0.99 0.7 0.88 0.50 1.00 0.39 0.49 0.70 0.70
DistilBERT 0.3633 0.4537   0.61 0.39 0.51 0.50 0.00 0.41 0.50 0.51 0.51

This is a really interesting result. While we expected that BERT would follow the axioms better than traditional models, given it’s better retrieval performance, in fact, it´s quite the opposite. That means that the current axioms are not enough for analysing ad-hoc retrieval datasets in a BERT and Transformers world. We believe that new axioms that better capture the nuances of models like BERT are needed, and an interesting new research venue, reviving axiomatic retrieval, appeared.

Data and code

Our trained model is available as a 🤗 Transformers model here, and the code for generating the datasets is available on GitHub here. The full paper (as a preprint) is also available here