This repository contains the implementation of Type4Py and instructions for re-producing the results of the paper.
- Dataset
- Installation Guide
- Usage Guide
- Converting Type4Py to ONNX
- VSCode Extension
- Using Local Pre-trained Model
- Type4Py Server
- Citing Type4Py
For Type4Py, we use the ManyTypes4Py dataset. You can download the latest version of the dataset here. Also, note that the dataset is already de-duplicated.
If you want to use your own dataset, it is essential to de-duplicate the dataset by using a tool like CD4Py.
Here are the recommended system requirements for training Type4Py on the MT4Py dataset:
- Linux-based OS (Ubuntu 18.04 or newer)
- Python 3.6 or newer
- A high-end NVIDIA GPU (w/ at least 8GB of VRAM)
- A CPU with 16 threads or higher (w/ at least 64GB of RAM)
git clone https://github.com/saltudelft/type4py.git && cd type4py
pip install .
Follow the below steps to train and evaluate the Type4Py model.
NOTE: Skip this step if you're using the ManyTypes4Py dataset.
$ type4py extract --c $DATA_PATH --o $OUTPUT_DIR --d $DUP_FILES --w $CORES
Description:
$DATA_PATH
: The path to the Python corpus or dataset.$OUTPUT_DIR
: The path to store processed projects.$DUP_FILES
: The path to the duplicate files, i.e., the*.jsonl.gz
file produced by CD4Py. [Optional]$CORES
: Number of CPU cores to use for processing projects.
$ type4py preprocess --o $OUTPUT_DIR --l $LIMIT
Description:
$OUTPUT_DIR
: The path that was used in the first step to store processed projects. For the MT4Py dataset, use the directory in which the dataset is extracted.$LIMIT
: The number of projects to be processed. [Optional]
$ type4py vectorize --o $OUTPUT_DIR
Description:
$OUTPUT_DIR
: The path that was used in the previous step to store processed projects.
$ type4py learn --o $OUTPUT_DIR --c --p $PARAM_FILE
Description:
-
$OUTPUT_DIR
: The path that was used in the previous step to store processed projects. -
--c
: Trains the complete model. Usetype4py learn -h
to see other configurations. -
--p $PARAM_FILE
: The path to user-provided hyper-parameters for the model. See this file as an example. [Optional]
$ type4py predict --o $OUTPUT_DIR --c
Description:
$OUTPUT_DIR
: The path that was used in the first step to store processed projects.--c
: Predicts using the complete model. Usetype4py predict -h
to see other configurations.
$ type4py eval --o $OUTPUT_DIR --t c --tp 10
Description:
$OUTPUT_DIR
: The path that was used in the first step to store processed projects.--t
: Evaluates the model considering different prediction tasks. E.g.,--t c
considers all predictions tasks, i.e., parameters, return, and variables. [Default: c]--tp 10
: Considers Top-10 predictions for evaluation. For this argument, You can choose a positive integer between 1 and 10. [Default: 10]
Use type4py eval -h
to see other options.
To reduce the dimension of the created type clusters in step 5, run the following command:
Note: The reduced version of type clusters causes a slight performance loss in type prediction.
$ type4py reduce --o $OUTPUT_DIR --d $DIMENSION
Description:
$OUTPUT_DIR
: The path that was used in the first step to store processed projects.$DIMENSION
: Reduces the dimension of type clusters to the specified value [Default: 256]
To convert the pre-trained Type4Py model to the ONNX format, use the following command:
$ type4py to_onnx --o $OUTPUT_DIR
Description:
$OUTPUT_DIR
: The path that was used in the usage section to store processed projects and the model.
Type4Py can be used in VSCode, which provides ML-based type auto-completion for Python files. The Type4Py's VSCode extension can be installed from the VS Marketplace here.
Type4Py's pre-trained model can be queried locally by using provided Docker images. See here for usage info.
The Type4Py server is deployed on our server, which exposes a public API and powers the VSCode extension. However, if you would like to deploy the Type4Py server on your own machine, you can adapt the server code here. Also, please feel free to reach out to us for deployment, using the pre-trained Type4Py model and how to train your own model by creating an issue.
@inproceedings{mir2022type4py,
title={Type4Py: practical deep similarity learning-based type inference for python},
author={Mir, Amir M and Lato{\v{s}}kinas, Evaldas and Proksch, Sebastian and Gousios, Georgios},
booktitle={Proceedings of the 44th International Conference on Software Engineering},
pages={2241--2252},
year={2022}
}