pytorch-lifestream
or ptls a library built upon PyTorch for building embeddings on discrete event sequences using self-supervision. It can process terabyte-size volumes of raw events like game history events, clickstream data, purchase history or card transactions.
It supports various methods of self-supervised training, adapted for event sequences:
- Contrastive Learning for Event Sequences (CoLES)
- Contrastive Predictive Coding (CPC)
- Replaced Token Detection (RTD) from ELECTRA
- Next Sequence Prediction (NSP) from BERT
- Sequences Order Prediction (SOP) from ALBERT
- Masked Language Model (MLM) from ROBERTA
It supports several types of encoders, including Transformer and RNN. It also supports many types of self-supervised losses.
The following variants of the contrastive losses are supported:
- Contrastive loss (paper)
- Triplet loss (paper)
- Binomial deviance loss (paper)
- Histogramm loss (paper)
- Margin loss (paper)
- VICReg loss (paper)
pip install pytorch-lifestream
# Ubuntu 20.04
sudo apt install python3.8 python3-venv
pip3 install pipenv
pipenv sync --dev # install packages exactly as specified in Pipfile.lock
pipenv shell
pytest
Learn event sequence deep learning analysis with Pytorch-Lifestream.
We have collected a set of topics related to the processing of event sequences. Most themes are supported by demo code using the ptls library. We recommend studying the topics sequentially. However, if you are familiar in some areas, you can skip them and take only the relevant topics.
ix | Topic | Description | Demo |
---|---|---|---|
1. | Prerequisites | ||
1.1. | PyTorch | Deep Learning framework | https://pytorch.org/ |
1.2. | PyTorch-Lightning | NN training framework | https://lightning.ai/ |
1.3. | (optional) Hydra | Configuration framework | https://hydra.cc/ and [demo/Hydra CoLES Training.ipynb](./demo/Hydra CoLES Training.ipynb) |
1.4. | pandas | Data preprocessing | https://pandas.pydata.org/ |
1.5. | (optional) PySpark | Big Data preprocessing | https://spark.apache.org/ |
2. | Event sequences | Problem statement and classical methods | |
2.1. | Event sequence for global problems | e.g. event sequence classification | TBD |
2.2. | Event sequence for local problems | e.g. next event prediction | TBD |
3. | Supervised neural networks | Supervised learning for event sequence classification | demo/supervised-sequence-to-target.ipynb |
3.1. | Network Types | Different networks for sequences | |
3.1.1. | Recurrent neural networks | TBD based on supervised-sequence-to-target.ipynb |
|
3.1.2. | (optional) Convolutional neural networks | TBD based on supervised-sequence-to-target.ipynb |
|
3.1.3. | Transformers | demo/supervised-sequence-to-target-transformer.ipynb | |
3.2. | Problem types | Different problems types for sequences | |
3.2.1. | Global problems | Binary, multilabel, regression, ... | TBD based on demo/multilabel-classification.ipynb |
3.2.2. | Local problems | Next event prediction | demo/event-sequence-local-embeddings.ipynb |
4. | Unsupervised learning | Pretrain self-supervised model with some proxy task | TBD based on demo/coles-emb.ipynb |
4.1. | (optional) Word2vec | Context based methods | |
4.2. | MLM, RTD, GPT | Event bases methods | Self-supervided training and embeddings for clients' transactions notebook |
4.3. | NSP, SOP | Sequence based methods | demo/nsp-sop-emb.ipynb |
5. | Contrastive and non-contrastive learning | Latent representation-based losses | TBD based on demo/coles-emb.ipynb |
5.1. | CoLES | demo/coles-emb.ipynb | |
5.2. | VICReg | TBD based on demo/coles-emb.ipynb | |
5.3. | CPC | TBD based on demo/coles-emb.ipynb | |
5.4. | MLM, TabFormer and others | Self-supervised TrxEncoder only training with Masked Language Model | demo/mlm-emb.ipynb demo/tabformer-emb.ipynb |
6. | Pretrained model usage | ||
6.1. | Downstream model on frozen embeddings | TBD based on demo/coles-emb.ipynb | |
6.2. | CatBoost embeddings features | demo/coles-catboost.ipynb | |
6.3. | Model finetuning | demo/coles-finetune.ipynb | |
7. | Preprocessing options | Data preparation demos | demo/preprocessing-demo.ipynb |
7.1 | ptls-format parquet data loading | PySpark and Parquet for data preprocessing | demo/pyspark-parquet.ipynb |
7.2. | Fast inference for big dataset | demo/extended_inference.ipynb | |
8. | Features special types | ||
8.1. | Using pretrained encoder to text features | demo/coles-pretrained-embeddings.ipynb | |
8.2 | Multi source models | demo/CoLES-demo-multimodal-unsupervised.ipynb | |
9. | Trx Encoding options | ||
9.1. | Basic options | TBD | |
9.2. | Transaction Quantization | TBD | |
9.3. | Transaction BPE | TBD |
Library description index
pytorch-lifestream
usage experiments on several public event datasets are available in the separate repo
- Data Fusion Contest 2022 report (in Russian)
- Data Fusion Contest 2022 report, Sber AI Lab team (in Russian)
- VK.com Graph ML Hackaton report (in Russian)
- VK.com Graph ML Hackaton report, AlfaBank team (in Russian)
- American Express - Default Prediction Kaggle contest report (in Russian)
- Data Fusion Contest 2024, Sber AI Lab team
- Data Fusion Contest 2024, Ivan Alexandrov
- American Express - Default Prediction
- COTIC -
pytorch-lifestream
is used in experiment for Continuous-time convolutions model of event sequences
- Make your chages via Fork and Pull request.
- Write unit test for new code in
ptls_tests
. - Check unit test via
pytest
: Example.
We have a paper you can cite it:
@inproceedings{
Babaev_2022, series={SIGMOD/PODS ’22},
title={CoLES: Contrastive Learning for Event Sequences with Self-Supervision},
url={http://dx.doi.org/10.1145/3514221.3526129},
DOI={10.1145/3514221.3526129},
booktitle={Proceedings of the 2022 International Conference on Management of Data},
publisher={ACM},
author={Babaev, Dmitrii and Ovsov, Nikita and Kireev, Ivan and Ivanova, Maria and Gusev, Gleb and Nazarov, Ivan and Tuzhilin, Alexander},
year={2022},
month=jun, collection={SIGMOD/PODS ’22}
}