Anyone that is at least slightly interested in Natural Language Processing has already probably heard about BERT, (or Bidirectional Encoder Representations from Transformers). Together with the original Transformer paper and other, newer, models, these new neural approaches have been sweeping every single leaderboard available for NLP.
This new fad, however, has not been received without criticism. Papers like this,that claims that BERT is learning only '’spurious statistical cues in the dataset’‘ are raising concerns about how much these neural models are actually learning from data and how much we can rely on these models.
However, another criticism is that these models are just TOO EXPENSIVE (not to mention the heavy environmental impact) to train from scratch. Essentially excluding anyone, but the industry, to develop new models that could beat BERT, XLNet, and others. This Twitter thread, involving some heavy names, like Konrad Kording from UPenn, David Pfau from DeepMind and Yann LeCun (do we need to introduce him?) summarized this topic interestingly:
TIL: if you work at Facebook or Google brain you have a really high carbon footprint. A single 8gp machine is as bad as flying trans Atlantic every month. The ML community should stop rewarding boring big compute. https://t.co/R5FOi0ksjV— KordingLab (@KordingLab) July 26, 2019
However, I’m forced to agree with LeCun’s closing statement:
I'm not a big fan of brute-force deep learning research, but the argument that industry should refrain from doing research that can't be done in academia so as not to "drive them out" is preposterous.— Yann LeCun (@ylecun) July 27, 2019
You see, this is not necessarily a problem. As long as the industry is still releasing code and data (even if only the pre-trained models), smaller research groups can still benefit a lot from these improvements.
And we don’t even need to look much further than SIGIR 2019’s accepted short papers, where at least three papers used BERT as a base for some creative approaches. For instance, MacAvaney, et al. came with a really interesting way to add BERT embeddings to a “traditional” (can we say this?) deep IR model for ad-hoc retrieval, with impressive results (probably new SOTA for Robust 04). Another example is Sakata, et al. with also great results on retrieving and ranking FAQ answers, using BERT for ranking query answers pairs.
Smaller research groups cannot (and should not) compete with these large industry-backed laboratories. However, we can be more ‘‘inventive’’ and creative than they are. We can use these models for other tasks, modify them for things that no one has thought before.
I’m not worried about the dominance of industry on NLP research. I’m excited. Let them spend the money. Let them do the large models. And let us be creative in how we apply that. In how we improve uppon them and in how we twist and turn these models for our own good.