Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will has_exact_match care about trivial hydrogens or chiral centers? #102

Closed
Boxylmer opened this issue Aug 27, 2023 · 2 comments
Closed

Comments

@Boxylmer
Copy link
Contributor

I'm working on a possible way to automatically find i"mportant" functional groups within a set of smiles. This involves...

  1. Scanning the dataset for all atoms, their hybridization, and aromaticity, which are lumped into tokens. (most sets have usually 23-30 unique tokens)
  2. generating all possible fragments of these tokens of size N (typically 3)
  3. Searching for the number of all possible occurances (I.e., including overlapping) within the dataset.

I've achieved the generation, but not being able to quickly search and I'd like to find which method is most appropriate for this. Given that I can manually generate the fragments to avoid iterating through smiles or Smarts queries, which search function should I use?

@mojaie
Copy link
Owner

mojaie commented Nov 12, 2023

I apologize for very late reply.
Sorry if I didn't understand your question correctly, but would this be similar to the task "Functional group analysis" in the following tutorial?

https://mojaie.github.io/MolecularGraph.jl_notebook/substructure_and_query.jl.html

I think the only way to do this is iterating through all dataset as you mentioned, at least in MolecularGraph.jl. I'm also interested in this field, and there may be some room for performance improvement of substructure search algorithms.

@Boxylmer
Copy link
Contributor Author

No issues on the delay! We're all busy and I really appreciate this project and the work you've put into it.

Background to this mini project: I want to see if arbitrarily fragmenting molecules can allow me to do data augmentation through building "functional group graphs" instead of graphs of atoms. This way, I can have multiple "functional group graphs" generated from the same molecule that has a property associated with it.

The atom token needs to be fast so that it can be used in quick comparisons and as building blocks for the functional groups they make up.

function AtomToken(mol::SMILESMolGraph, idx::Integer)
    aromaticity = is_aromatic(mol)[idx]
    atomic_number = UInt8(atomnumber(atomsymbol(mol)[idx]))
    hybrid = UInt8(hybridization_symbol_to_int(hybridization(mol)[idx]))
    return AtomToken(atomic_number, aromaticity, hybrid)
end

I wasn't able to generate arbitrary smarts with this, but I just constructed my own graph of these tokens and made a graph search for them that could find all possible instances of linear groups. The linear part is a concession I made because it lets me simplify the subgraph search significantly, but also means that only n=3 size groups make sense, as the possibility of branched subgroups starts at n=4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants