Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flesh out WeaveFeaturization #128

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Flesh out WeaveFeaturization #128

wants to merge 14 commits into from

Conversation

DhairyaLGandhi
Copy link
Member

@DhairyaLGandhi DhairyaLGandhi commented Nov 29, 2021

This is based on the weve featurization from the weave implementation in deepchem, but not exactly a direct port.

I'm not sure I love the way we map from species feature names to the MolecularGraph functions here, seems like that would make it very manual to "just" use some feature provided by MolecularGraph.

There are still a couple of TODOs, one of them being fleshing out both atom_features as well as pair_features, but before I implement them, I wanted to ask whether we would be fine with something like a FeaturizedMol (similar to FeaturizedAtom, it can hold featurized atom, bond and pair info), or is there a different method I should look into?

@rkurchin
Copy link
Member

rkurchin commented Dec 1, 2021

The result should still be a FeaturizedAtoms object, it would just have different type parameters and the content of the encoded_features field would be more complicated.

@@ -48,6 +48,7 @@ function SpeciesFeatureDescriptor(name::String)
else
# TODO: figure out default binning situation for continuous-valued SFD's
#codec = OneHotOneCold(false, )
codec = DirectCodec(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this actually make sense as a default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt it, but I needed something there for the time being. Ideally we would have have to get a number of bins from the user or guess. I don't think OneHotOneCold would work since that implies categorical variable, which continuous variables are not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OneHotOneCold works for categorical variables, you just have to specify bins. That's what we've been doing all along since the DirectCodec was just added...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. DataFrames does some kind of trickery to inspect whether a column is categorical or notz but I don't think we need to be doing anything too sophisticated at this stage. I was thinking assuming something is a continuous variable as the default would be reasonable and if a feature is categorical that would need to be described explicitly.

default_bond_feature_list = ["bondorder", "isaromaticbond", "isringbond"])
species_feats = intersect(keys(Utils.SpeciesFeatureUtils.sfd_names_props), default_atom_feature_list)

atoms = SpeciesFeatureDescriptor.(species_feats)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a placeholder, since we would want to get both the species features and the atom features from different arrays. Would it be fine to split those out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean how this is clunky. What do you mean by "split them out," though? Just split the species features and element features into separate lists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, explicitly have different lists for different feature descriptors. We can concatenate the emergent encodings at either the encode step or the featurize step.

@codecov
Copy link

codecov bot commented Dec 1, 2021

Codecov Report

Merging #128 (af601ea) into main (0c8f2b7) will decrease coverage by 3.31%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #128      +/-   ##
==========================================
- Coverage   80.73%   77.42%   -3.32%     
==========================================
  Files          20       20              
  Lines         571      598      +27     
==========================================
+ Hits          461      463       +2     
- Misses        110      135      +25     
Impacted Files Coverage Δ
src/ChemistryFeaturization.jl 80.00% <ø> (ø)
src/features/speciesfeature.jl 76.19% <0.00%> (ø)
src/featurizations/weavefeaturization.jl 0.00% <0.00%> (ø)
src/utils/orbitalfeature_utils.jl 97.43% <0.00%> (+0.02%) ⬆️
src/atoms/atomgraph.jl 76.25% <0.00%> (+0.30%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c8f2b7...af601ea. Read the comment docs.

@DhairyaLGandhi
Copy link
Member Author

If we have missing (and in general Union eltypes) in the FeaturizedAtoms we might lose out on the BLAS calls and be slower if we fall back to generic methods, which I'd love to avoid. We might not be, but we might have to benchmark to know how that looks.

@DhairyaLGandhi
Copy link
Member Author

DhairyaLGandhi commented Dec 2, 2021

Where are the missings used ultimately?

@rkurchin
Copy link
Member

rkurchin commented Dec 6, 2021

The missings just indicate that that bond feature isn't defined for that combination of atoms (because they're not bonded to each other).

@rkurchin rkurchin linked an issue Dec 9, 2021 that may be closed by this pull request
3 tasks
Comment on lines +16 to +29
function WeaveFeaturization(element_feature_list = ["Atomic no"],
species_feature_list = ["degree",
"implicithconnected",
"charge",
"radical_electrons",
"hybridization",
"isaromatic",
"hydrogenconnected"],
bond_feature_list = ["bondorder", "isaromaticbond", "isringbond"])
elements = ElementFeatureDescriptor.(element_feature_list)
species = SpeciesFeatureDescriptor.(species_feature_list)
bonds = BondFeatureDescriptor.(bond_feature_list)
# pairs = PairFeatureDescriptor.(default_atom_feature_list)
WeaveFeaturization(elements, species, bonds, bonds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
function WeaveFeaturization(element_feature_list = ["Atomic no"],
species_feature_list = ["degree",
"implicithconnected",
"charge",
"radical_electrons",
"hybridization",
"isaromatic",
"hydrogenconnected"],
bond_feature_list = ["bondorder", "isaromaticbond", "isringbond"])
elements = ElementFeatureDescriptor.(element_feature_list)
species = SpeciesFeatureDescriptor.(species_feature_list)
bonds = BondFeatureDescriptor.(bond_feature_list)
# pairs = PairFeatureDescriptor.(default_atom_feature_list)
WeaveFeaturization(elements, species, bonds, bonds)
function WeaveFeaturization(
element_feature_list = ["Atomic no"],
species_feature_list = [
"degree",
"implicithconnected",
"charge",
"radical_electrons",
"hybridization",
"isaromatic",
"hydrogenconnected",
],
bond_feature_list = ["bondorder", "isaromaticbond", "isringbond"],
)
elements = ElementFeatureDescriptor.(element_feature_list)
species = SpeciesFeatureDescriptor.(species_feature_list)
bonds = BondFeatureDescriptor.(bond_feature_list)
# pairs = PairFeatureDescriptor.(default_atom_feature_list)
WeaveFeaturization(elements, species, bonds, bonds)

end

function atom_features(feat, mol; kw...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change

Comment on lines +49 to +93
"C",
"N",
"O",
"S",
"F",
"Si",
"P",
"Cl",
"Br",
"Mg",
"Na",
"Ca",
"Fe",
"As",
"Al",
"I",
"B",
"V",
"K",
"Tl",
"Yb",
"Sb",
"Sn",
"Ag",
"Pd",
"Co",
"Se",
"Ti",
"Zn",
"H", # H?
"Li",
"Ge",
"Cu",
"Au",
"Ni",
"Cd",
"In",
"Mn",
"Zr",
"Cr",
"Pt",
"Hg",
"Pb",
"Unknown"
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
"C",
"N",
"O",
"S",
"F",
"Si",
"P",
"Cl",
"Br",
"Mg",
"Na",
"Ca",
"Fe",
"As",
"Al",
"I",
"B",
"V",
"K",
"Tl",
"Yb",
"Sb",
"Sn",
"Ag",
"Pd",
"Co",
"Se",
"Ti",
"Zn",
"H", # H?
"Li",
"Ge",
"Cu",
"Au",
"Ni",
"Cd",
"In",
"Mn",
"Zr",
"Cr",
"Pt",
"Hg",
"Pb",
"Unknown"
]
"C",
"N",
"O",
"S",
"F",
"Si",
"P",
"Cl",
"Br",
"Mg",
"Na",
"Ca",
"Fe",
"As",
"Al",
"I",
"B",
"V",
"K",
"Tl",
"Yb",
"Sb",
"Sn",
"Ag",
"Pd",
"Co",
"Se",
"Ti",
"Zn",
"H", # H?
"Li",
"Ge",
"Cu",
"Au",
"Ni",
"Cd",
"In",
"Mn",
"Zr",
"Cr",
"Pt",
"Hg",
"Pb",
"Unknown",
]

Comment on lines +99 to +101
atom_features
bond_features
pair_features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
atom_features
bond_features
pair_features
atom_features::Any
bond_features::Any
pair_features::Any

Comment on lines +104 to +114
function encode(fzn::WeaveFeaturization, ag::AtomGraph; atom_feature_kwargs = (;),
bond_feature_kwargs = (;),
pair_feature_kwargs = (;))
sf = mapreduce(x -> encode(x, ag, atom_feature_kwargs...), vcat, fzn.atom_features)
ef = mapreduce(x -> encode(x, ag, atom_feature_kwargs...), vcat, fzn.element_features)
atom_and_elements = vcat(sf, ef)
bf = cat(map(x -> encode(x, ag, bond_feature_kwargs...), fzn.bond_features)..., dims = 3)
pf = cat(map(x -> encode(x, ag, pair_feature_kwargs...), fzn.pair_features)..., dims = 3)
# Return FeaturizedAtoms here
atom_and_elements, vcat(bf, pf)
# FeaturizedWeave(atom_and_elements, bf, pf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
function encode(fzn::WeaveFeaturization, ag::AtomGraph; atom_feature_kwargs = (;),
bond_feature_kwargs = (;),
pair_feature_kwargs = (;))
sf = mapreduce(x -> encode(x, ag, atom_feature_kwargs...), vcat, fzn.atom_features)
ef = mapreduce(x -> encode(x, ag, atom_feature_kwargs...), vcat, fzn.element_features)
atom_and_elements = vcat(sf, ef)
bf = cat(map(x -> encode(x, ag, bond_feature_kwargs...), fzn.bond_features)..., dims = 3)
pf = cat(map(x -> encode(x, ag, pair_feature_kwargs...), fzn.pair_features)..., dims = 3)
# Return FeaturizedAtoms here
atom_and_elements, vcat(bf, pf)
# FeaturizedWeave(atom_and_elements, bf, pf)
function encode(
fzn::WeaveFeaturization,
ag::AtomGraph;
atom_feature_kwargs = (;),
bond_feature_kwargs = (;),
pair_feature_kwargs = (;),
)
sf = mapreduce(x -> encode(x, ag, atom_feature_kwargs...), vcat, fzn.atom_features)
ef = mapreduce(x -> encode(x, ag, atom_feature_kwargs...), vcat, fzn.element_features)
atom_and_elements = vcat(sf, ef)
bf =
cat(map(x -> encode(x, ag, bond_feature_kwargs...), fzn.bond_features)..., dims = 3)
pf =
cat(map(x -> encode(x, ag, pair_feature_kwargs...), fzn.pair_features)..., dims = 3)
# Return FeaturizedAtoms here
atom_and_elements, vcat(bf, pf)
# FeaturizedWeave(atom_and_elements, bf, pf)

Comment on lines 118 to 123
println(io, "WeaveFeaturization(")
println(io)
println(io, " Species Features: $(map(x -> x.name, fzn.element_features))")
println(io, " Atom Features: $(map(x -> x.name, fzn.atom_features))")
println(io, " Bond Features: $(map(x -> x.name, fzn.bond_features))")
println(io, ")")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
println(io, "WeaveFeaturization(")
println(io)
println(io, " Species Features: $(map(x -> x.name, fzn.element_features))")
println(io, " Atom Features: $(map(x -> x.name, fzn.atom_features))")
println(io, " Bond Features: $(map(x -> x.name, fzn.bond_features))")
println(io, ")")
println(io, "WeaveFeaturization(")
println(io)
println(io, " Species Features: $(map(x -> x.name, fzn.element_features))")
println(io, " Atom Features: $(map(x -> x.name, fzn.atom_features))")
println(io, " Bond Features: $(map(x -> x.name, fzn.bond_features))")
println(io, ")")

end

function bond_feature(bond, mol; kw...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change

Comment on lines +122 to +126
println(io, "WeaveFeaturization(")
println(io, " Species Features: $(map(x -> x.name, fzn.element_features))")
println(io, " Atom Features: $(map(x -> x.name, fzn.atom_features))")
println(io, " Bond Features: $(map(x -> x.name, fzn.bond_features))")
println(io, ")")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
println(io, "WeaveFeaturization(")
println(io, " Species Features: $(map(x -> x.name, fzn.element_features))")
println(io, " Atom Features: $(map(x -> x.name, fzn.atom_features))")
println(io, " Bond Features: $(map(x -> x.name, fzn.bond_features))")
println(io, ")")
println(io, "WeaveFeaturization(")
println(io, " Species Features: $(map(x -> x.name, fzn.element_features))")
println(io, " Atom Features: $(map(x -> x.name, fzn.atom_features))")
println(io, " Bond Features: $(map(x -> x.name, fzn.bond_features))")
println(io, ")")

@rkurchin
Copy link
Member

rkurchin commented Jan 7, 2022

JFYI, this should eventually be part of WeaveModel under the new organization, so if you want to make it into a PR over there now feel free, or if you'd rather wait until the reorg is actually registered, that's fine too.

@rkurchin rkurchin force-pushed the main branch 4 times, most recently from dea0c7a to 2b41b38 Compare February 24, 2022 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

finish updating WeaveModel-related stuff for restructure
2 participants