Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graph_from_smiles() missing #85

Open
schatzsc opened this issue Oct 15, 2022 · 6 comments
Open

graph_from_smiles() missing #85

schatzsc opened this issue Oct 15, 2022 · 6 comments
Labels

Comments

@schatzsc
Copy link
Collaborator

schatzsc commented Oct 15, 2022

In tests_chembl.ipynb the function graph_from_smiles() is used to convert a SMILES string to a graph, which is useful when processing ChEMBL and PubChem structures.

This was once included in tucan.io according to from tucan.io import graph_from_smiles

However, it seems like this function got lost somewhere on the way and is not included in molfile_reader.py anymore (although admittedly it also does not make much sense under this name).

I would strongly like to have it back again to tucan.io

I can also provide a graph_from_csd routine (although it requires a local installation of the CSD database) and currently work on a graph_from_pubchem function, since one can also read the atoms and bonds directly (as with the CSD), therefore making the detour via the SMILES string unnecessary, see:

PubChemPy Dictionary representation

pcp

@flange-ipb
Copy link
Collaborator

graph_from_smiles was removed in commit af7420be39059e7fd05c08b6cf0704e0d385ccb9 due to its use of RDKit.

@schatzsc
Copy link
Collaborator Author

Thank you very much for pointing to the commit where this was removed. Actually did not remember that it was one of the parts based on RDKit that we dediced to kick out due to problems with metal complex handling.

In the meantime, I also figured out how to access ChEMBL and PubChem directly without "detour" via molfile or SMILES.

Interestingly, PubChem returns a data structure with explicit hydrogens that is extremely easy to convert to a graph, see graph_from_pubchem()

ChEMBL on the other hand returns a data structure without explicit hydrogens with only some very few exceptions needed to handle tautomers, so it is basically the "H-pruned" heavy atom core. Therefore, need the implicit_to_explicit_hydrogen preprocessor here, which has some initial code in implicit_to_explicit_hydrogen_preprocessor() as found in my "TUCAN playground"

@schatzsc
Copy link
Collaborator Author

Still, seems to be only these few lines from the above code section:

from rdkit import Chem

def graph_from_smiles(smiles: str):
    molfile = _molfile3000_from_smiles(smiles)
    element_symbols, bonds = _parse_molfile3000(molfile)
    return graph_from_moldata(element_symbols, bonds)
    
def _molfile3000_from_smiles(smiles: str):
    m = Chem.MolFromSmiles(smiles, sanitize=False)
    return Chem.MolToMolBlock(m, forceV3000=True, includeStereo=False, kekulize=False)

Even if they "inherit" the issues of the RDKit with metals I'd possibly argue to have a function for that for people to use at own risk?

@schatzsc
Copy link
Collaborator Author

This is one of the things I forgot in the recent discussion - would be nice to also have SMILES as input for TUCAN, which can be done by above code fragment using RDkit SMILES_to_v3000_molfile function

@rapodaca
Copy link

I'm curious - aside from the issue metal complexes, why was RDKit removed?

@schatzsc
Copy link
Collaborator Author

schatzsc commented Jan 12, 2023

Good question - don't really remember the answer anymore since this modification was done more than 6 months ago, but maybe Jan can give feedback.

My best guess is that it was the last dependency on RDKit that we had in the TUCAN and on one hand, we did not need it for anything else anymore, so would simplify dependencies, and then of course "the issue with the metal complexes" is not a minor one.

We had a long developers' meeting today with some new people joining and will further formalize and harmonzie the different input variants in the upcoming months. Also plan for PubChem, ChEMBL and CSD interfaces as well as ORCA and Gaussian computational chemistry file formats as input, plus a lot of other interesting stuff (-:=

On that occassion - you are really missed on Twitter, just realized today that there were new posts in your blog ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants