Project Overview This project demonstrates graph classification techniques using the Graph Isomorphism Network (GIN) architecture to process and model molecular structures for classification and regression tasks. Specifically, it focuses on molecular datasets, leveraging SMILES strings to build graph representations. The project includes data processing, model implementation with PyTorch Geometric, and a study of ensemble performance combining GIN and Graph Convolutional Network (GCN) architectures.
- Molecular Data Processing: Includes steps to process SMILES data into graph-compatible formats.
- Classification and Regression: Models both binary classification (HIV activity) and regression tasks (lipophilicity prediction).
- Ensemble Modeling: Tests the performance of combining GCN and GIN architectures.
Prerequisites
- Python 3.x
- Jupyter Notebook or Google Colab
- Required packages:
- torch
- torch-geometric
- rdkit
- ogb
- Clone the repository:
git clone https://github.com/btarun13/Graph_classification_related.git
cd your-repo-name
-
Get code locally or push it to colab
-
(Optional) Install additional dependencies if using Google Colab.
# Run in a cell in Colab
!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-2.2.1+cu121.html
!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-2.2.1+cu121.html
!pip install torch-geometric
!pip install rdkit
- Load Data: Ensure the datasets for HIV and Lipophilicity are in your environment or specify their paths. Example datasets can be downloaded from MoleculeNet.
- Data Processing: Convert SMILES data to graph structures suitable for the GIN model.
- Training and Evaluation: Follow the steps in the notebook to train GIN and GCN models, evaluate performance, and explore ensemble approaches.
- Data loading and preprocessing:
Copy code
hiv_data = pd.read_csv("/path/to/HIV.csv")
lipo_data = pd.read_csv("/path/to/Lipophilicity.csv")
- Model Training:
# Train GIN model
gin_model = GINConv(...)
- Follow the training steps in the notebook
The notebook provides an evaluation of:
- Classification Accuracy for HIV activity prediction.
- Ensemble Comparison between GIN, GCN, and combined models.
- Final function would give a probability estimate with SMILE string and model(you use for estimate) Eg. smile_to_hiv_prob(i,best_model).item() == estimate
Special thanks to the creators of the datasets provided by MoleculeNet, and to the developers of PyTorch Geometric and the Open Graph Benchmark (OGB) team.
This project is licensed under the MIT License.