This repository contains our math information retrieval (MIR) system for the ARQMath3 competition that is based on the soft cosine measure. The repository also contains the paper that describes our system.
- Compare the soft vector space model against sparse information retrieval baselines.
- Compare performance of text, text + LaTeX, and text + Tangent-L as math representations
- Compare performance of non-positional word2vec and positional
word2vec
embeddings - Compare performance of word2vec embeddings and decontextualized
roberta-base
embeddings - Compare performance of decontextualized embeddings of
roberta-base
and tunedroberta-base
- Compare performance of interpolated and joint SCM models for text and math
- Prepare dataset
- Train tokenizer
- Tune
roberta-base
model - Train
word2vec
models - Produce decontextualized word embeddings
- Produce dictionaries
- Produce term similarity matrices
- Produce ARQMath runs
- Optimize soft vector space similarity matrices
- Accelerated word embedding decontextualization using the batched algorithm for averages by Matt Hancock.
- Online demo of our system using the Document Maps visualization tool.
- Our
witiko/mathberta
model at the 🤗 Model Hub. - Our paper from CLEF 2022 that describes our system.
- Our presentation from CLEF 2022 that describes our system.
- Add extrinsic end-task evaluation on NumGLUE to
03-finetune-roberta.ipynb
. Plot performance on the five different NumGLUE tasks (axis y) over checkpoints (axis x). - Add end-task evaluation on ARQMath-3 topics over checkpoints to
08-produce-arqmath-runs.ipynb
.
Vít Novotný and Michal Štefánik. “Combining Sparse and Dense Information Retrieval. Soft Vector Space Model and MathBERTa at ARQMath-3 Task 1 (Answer Retrieval)”. In: Proceedings of the Working Notes of CLEF 2022. Ed. by Guglielmo Faggioli, Nicola Ferro, Allan Hanbury, and Martin Potthast. CEUR-WS, 2022, pp. 104–118. URL: http://ceur-ws.org/Vol-3180/paper-06.pdf (visited on 08/12/2022).
@inproceedings{novotny2022combining,
booktitle = {Proceedings of the Working Notes of {CLEF} 2022},
editor = {Faggioli, Guglielmo and Ferro, Nicola and Hanbury, Allan and Potthast, Martin},
issn = {1613-0073},
title = {Combining Sparse and Dense Information Retrieval},
subtitle = {Soft Vector Space Model and MathBERTa at ARQMath-3 Task 1 (Answer Retrieval)},
author = {Novotný, Vít and Štefánik, Michal},
publisher = {{CEUR-WS}},
year = {2022},
pages = {104-118},
numpages = {15},
url = {http://ceur-ws.org/Vol-3180/paper-06.pdf},
urldate = {2022-08-12},
}