Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MappingSetDataFrame issues when no converter passed #513

Open
8 tasks
joeflack4 opened this issue Apr 6, 2024 · 11 comments
Open
8 tasks

MappingSetDataFrame issues when no converter passed #513

joeflack4 opened this issue Apr 6, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@joeflack4
Copy link
Collaborator

joeflack4 commented Apr 6, 2024

Overview

I just noticed several issues with that arise when using MappingSetDataFrame and not passing any Converter along with it.

Sub-issues

  • 1. Performance
  • 2. Incorrect curie_map
  • 3. UX: Should correctly automatically instantiate Converter

Sub-issue details

1. Performance

  • With converter msdf = MappingSetDataFrame(converter=converter, df=df, metadata=metadata): Takes 0:00:00.000008 seconds
  • No converter: msdf = MappingSetDataFrame(df=df, metadata=metadata): Takes 33.206355 seconds

2. Incorrect curie_map

FYI:

2.1. Missing entries

In my curie_map in my metadata dictionary, I have some prefixes (e.g. icd11.foundation that do not show up in the output file.

2.1.1. Add test

Prepare test w/ input file(s) with one row & Python code used to invoke. Possibly another test substituting the Python code for a command, if that is possible.

2.2. Too many entries

Compare the following results.

  1. n=4. This is the amount of entries I want, and it is the amount that I get when using some custom code I wrote. ordo-icd11.sssom - joes ad hoc way.tsv.zip
  2. n=10. This is the amount of entries I get when first instantiating a Converter and passing that to MappingSetDataFrame. I also mentioned this in UX: Extra entries in curie_map of written TSV #514. Can be addressed by .clean_prefix_map(): remove_unused_builtins param #537. See example: ordo-icd11.sssom - with converter.tsv.zip
  3. n=1,547. This is the amount of entries I get when instantiating a MappingSetDataFrame passing in my metadata but no Converter. See example: ordo-icd11.sssom - no converter.tsv.zip
2.2.1. Add test

3. UX: Should automatically instantiate Converter

I haven't fully thought through possible downsides of this, but I recommend that when MappingSetDataFrame is instantiated using only a df and metadata, but no converter, it should instantiate the converter via curies.Converter.from_prefix_map(metadata['curie_map']).

@matentzn
Copy link
Collaborator

matentzn commented Apr 7, 2024

Thank you for the issue, responses will be a bit piecemeal:

In my curie_map in my metadata dictionary, I have some prefixes (e.g. icd11.foundation that do not show up in the output file.

Is this after msdf.clean()? If so, is there at least 1 entity in the data frame that has that prefix?

n=10. This is the amount of entries I get when first instantiating a Converter and passing that to MappingSetDataFrame.

Are you sure the remaining 6 are not the build in sssom prefixes?

No converter: msdf = MappingSetDataFrame(df=df, metadata=metadata): Takes 33.206355 seconds

Wow - long time!

@joeflack4
Copy link
Collaborator Author

joeflack4 commented Apr 7, 2024

i. Missing prefix

Joe:

In my curie_map in my metadata dictionary, I have some prefixes (e.g. icd11.foundation that do not show up in the output file.

Nico:

Is this after msdf.clean()? If so, is there at least 1 entity in the data frame that has that prefix?

This happens whether or not I call msdf.clean_prefix_map() or msdf.clean_context(). And yep, every row in the data frame has the prefix.

ii. Built-in SSSOM prefixes

Yep, they are. My preference is to not have them there, but I can understand why we'd want to include them; maybe it is better.

How about this. Can we consider adding a remove_unused_builtins param to msdf.clean_prefix_map() for people like me? Really low priority anyway.

@matentzn
Copy link
Collaborator

matentzn commented Apr 8, 2024

This happens whether or not I call msdf.clean_prefix_map() or msdf.clean_context(). And yep, every row in the data frame has the prefix.

i. Bug.. We need a minimal test for this. Can you prepare one (input file(s) with one row, command used to invoke)?

How about this. Can we consider adding a remove_unused_builtins param to msdf.clean_prefix_map() for people like me? Really low priority anyway.

ii. Can you make a Documentation request in https://github.com/mapping-commons/sssom/issues (model repo), asking for clarification if built-in prefixes are required to appear in the metadata or just recommended? If it ends up as recommended, I agree with you, we will add your flag.

@joeflack4
Copy link
Collaborator Author

joeflack4 commented Apr 8, 2024

(i) Will do! FYI though, I'm not running a command. This is the Python API where this is a problem. Though perhaps this problem can also arise in the CLI.

(ii) Will do!

(iii) Instantiating Converter correctly
In #514 you wrote that the Converter should already be being instantiated automatically by MappingSetDataFrame when it is not explicitly passed. Perhaps that is the case, but something appears to be wrong with that instantiation. My thought (sub-issue (3) in the OP) is that this should happen early on in the class's construction, and that it should do curies.Converter.from_prefix_map(metadata['curie_map']).

@matentzn
Copy link
Collaborator

matentzn commented Apr 9, 2024

Look at the default values: https://github.com/mapping-commons/sssom-py/blob/master/src/sssom/util.py#L96

By default, it is constructed with a converter! At least if I understand the dataclass syntax correctly.

@joeflack4
Copy link
Collaborator Author

joeflack4 commented Apr 9, 2024

I don't know enough about dataclasses to be a huge help here, but...

We know that it is not instantiating and/or using the instantiated Converter correctly because:

  1. The results (e.g. serialized curie_map) are different when in one case you pass the Converter, and in another case you don't.
    Results in both (i) entries missing, and (ii) entries added. I updated the OP such that we can write a test for both of these cases.
  2. It takes longer to create the msdf when you don't pass the converter.
    Precisely 4,150,794.375 times longer in my 1 recent case.

@lru_cache(1)
def get_converter() -> Converter:
"""Get a converter."""
return curies.chain([_get_built_in_prefix_map(), _get_default_converter()])

I'd suggest possibly:

Add param:

curie_map: Dict = None

Update:

return curies.Converter.from_prefix_map(metadata['curie_map']) if curie_map \
    else curies.chain([_get_built_in_prefix_map(), _get_default_converter()])

Delete: @lru_cache(1)
Though you could move else curies.chain([_get_built_in_prefix_map(), _get_default_converter()] into a separate function and cache that, but it should be pretty fast so I don't know that the cache is very useful, and IDK if it might create problems. Probably not, but if the performance gains are nominal, then I'd still suggest dropping it.

Then it looks like this needs to be updated to pass metadata['curie_map'] if 'curie_map' in metadata. IDK how but I'd imagine ChatGPT does.

@matentzn
Copy link
Collaborator

Unfortunately we dont have Harshad for a while now, so we will have to sort this ourselves.

I am not enthusiastic about any of your proposed solutions, and the problem space is too large for me to reason over in the short time I have per issue.

Do you agree that by far the most critical issue you mention here is:

In my curie_map in my metadata dictionary, I have some prefixes (e.g. icd11.foundation that do not show up in the output file.

? Can you add a test to https://github.com/mapping-commons/sssom-py/blob/master/tests/test_utils.py that demonstrates the problem?

@joeflack4
Copy link
Collaborator Author

joeflack4 commented Apr 10, 2024

I understand this isn't something you can tackle deeply right now.

We could also do:

convs = [_get_built_in_prefix_map(), _get_default_converter()]
if curie_map:
    convs = [curies.Converter.from_prefix_map(metadata['curie_map'])] + convs
return curies.chain(convs)

Yep, adding a test is in the OP tasks / on my schedule.

@joeflack4
Copy link
Collaborator Author

Nico wrote:

Can you make a Documentation request...

Done!

@joeflack4
Copy link
Collaborator Author

joeflack4 commented May 22, 2024

Built-in SSSOM prefixes

@matentzn Given that these built-ins are not required, do you mind if I make an issue for this suggestion?:

Can we consider adding a remove_unused_builtins param to msdf.clean_prefix_map() for people like me? Really low priority anyway.

edit: Created the issue:

@matentzn
Copy link
Collaborator

Sure, feel free to make an issue, if the PR is super small, I we can release it fairly quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants