Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert SSSOM TSV => JSON => TSV and confirm the MappingSetDataFrame object remains consistent. #429

Closed
wants to merge 3 commits into from

Conversation

hrshdhgd
Copy link
Contributor

@hrshdhgd hrshdhgd commented Sep 26, 2023

Fixes mapping-commons/sssom#321

Given a simple sssom tsv

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  ORCID: "https://orcid.org/"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#creator_id:
#  - "ORCID:0000-0002-6095-8718"
subject_id	subject_label	predicate_id	object_id	mapping_justification
FBbt:00000001	organism	semapv:crossSpeciesExactMatch	UBERON:0000468	semapv:ManualMappingCuration
  • I first parse it using parse_sssom_table()
  • msdf1.clean_prefix_map()
  • Convert to JSON using to_json() and write to file
  • Parse again using parse_sssom_json()
  • msdf2.clean_prefix_map()

The prefix maps of both should be the same.

@hrshdhgd
Copy link
Contributor Author

hrshdhgd commented Sep 26, 2023

At the moment this fails. Here's the reason why:

Here's the test code:

def test_tsv_to_json_and_back(self):
"""Test converting SSSOM TSV => JSON => SSSOM TSV such that it is reproducible."""
sample_tsv = f"{test_data_dir}/sample1.sssom.tsv"
json_outfile = f"{test_out_dir}/sample1.json"
msdf1 = parse_sssom_table(sample_tsv)
msdf1.clean_prefix_map()
json_doc = to_json(msdf1)
self.assertEqual(msdf1.prefix_map, json_doc["@context"])
with open(json_outfile, "w") as file:
write_json(msdf1, file)
msdf2 = parse_sssom_json(json_outfile)
msdf2.clean_prefix_map()
self.assertEqual(msdf1.prefix_map, msdf2.prefix_map)

msdf1.prefix_map looks like this:

{
'owl': 'http://www.w3.org/2002/07/owl#', 
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 
'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
'semapv': 'https://w3id.org/semapv/vocab/', 
'skos': 'http://www.w3.org/2004/02/skos/core#',
'sssom': 'https://w3id.org/sssom/', 
'FBbt': 'http://purl.obolibrary.org/obo/FBbt_', 
'ORCID': 'https://orcid.org/', 
'UBERON': 'http://purl.obolibrary.org/obo/UBERON_'
}

whereas msdf2.prefix_map looks like this

{
'owl': 'http://www.w3.org/2002/07/owl#', 
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 
'rdfs': 'http://www.w3.org/2000/01/rdf-schema#', 
'semapv': 'https://w3id.org/semapv/vocab/', 
'skos': 'http://www.w3.org/2004/02/skos/core#', 
'sssom': 'https://w3id.org/sssom/', 
'FBbt': 'http://purl.obolibrary.org/obo/FBbt_', 
'UBERON': 'http://purl.obolibrary.org/obo/UBERON_'
}

'ORCID': 'https://orcid.org/', is missing from msdf2.prefix_map

In the clean_prefix_map() function, the prefixes_in_table variable for msdf1 = {'owl', 'rdfs', 'skos', 'sssom', 'UBERON', 'ORCID', 'semapv', 'rdf', 'FBbt'} which is the same for msdf2. In spite of that, the self.prefix_map generated for both are different.

self.prefix_map = dict(subconverter.bimap)

This line is where the ORCID prefix gets lost:

subconverter = self.converter.get_subconverter(prefixes_in_table)

@cthoyt , would you have any idea why this would be happening?

cc: @matentzn , @gouttegd

@hrshdhgd hrshdhgd changed the title Convert SSSOM TSV => JSON => TSV and confirm the MappingSetDataFrame remains consistent. Convert SSSOM TSV => JSON => TSV and confirm the MappingSetDataFrame object remains consistent. Sep 26, 2023
@@ -212,21 +212,28 @@ def parse_sssom_table(
logging.info(f"Externally provided metadata {k}:{v} is added to metadata set.")
sssom_metadata[k] = v
meta = sssom_metadata

if "curie_map" in sssom_metadata:
if CURIE_MAP in sssom_metadata:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rewritten all of this code using curies already and am waiting for #426 (ready) then #401 (needs some more work) to get merged. I think it's better that you reduce effort on re-writing it again, since the curies solution is much more clean and elegant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split the code for updating this into #431

# This takes priority over default prefix_map in case of a tie.
jsondoc_prefix_map = jsondoc["@context"]

# Convert keys in both maps to lower case for comparison
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this wise? I think we should not be doing all of this lowercasing. The prefix maps should be correct as they are. Doing all of these ad-hoc operations makes sssom-py insanely hard to work with (I can tell you this from first hand experience since I have tried to understand the whole history of the package)

Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wait on making progress with this until the PRs that replace a lot of the same code with curies are merged - hopefully this encourages much more consistent and elegant handling of prefix maps and reduces the number of changes to address the issue here

"""Test converting SSSOM TSV => JSON => SSSOM TSV such that it is reproducible."""
sample_tsv = f"{test_data_dir}/sample1.sssom.tsv"
json_outfile = f"{test_out_dir}/sample1.json"
msdf1 = parse_sssom_table(sample_tsv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hrshdhgd other hints: when writing tests, make the things you're testing more explicit. The way this test is written, it doesn't really test anythign specific. Why not explicitly check that ORCID is in the places you expect to to be? Otherwise, someone reading this test learns nothing about what's actually supposed to be happening here.

write_json(msdf1, file)

msdf2 = parse_sssom_json(json_outfile)
self.assertIn("ORCID", msdf2.prefix_map)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some more explicit tests so you can see where things are actually going wrong, which appers to be in the SSSOM JSON parsing, and has nothing to do with cleaning the prefix map like I suggested in a previous comment

cthoyt added a commit that referenced this pull request Sep 26, 2023
This provides an alternative to #429 that makes more explicit the chaining operations done on the metadata and prefix maps

This is also a good change to carefully document the way that this is handled, since I might not have captured it accurately
@hrshdhgd
Copy link
Contributor Author

hrshdhgd commented Sep 26, 2023

@cthoyt , thanks for the insight! This branch was just to investigate the issue Damien pointed out (hence the draft). As for the lowercasing, I don't like that solution at all. I was just trying to understand the issue better so I could explain it to someone else via code (you and Nico). Also, I assume these files will change once your PRs are merged so I have no intention to continue this branch. May start afresh or merge main into this branch first IF the problem persists.

cthoyt added a commit that referenced this pull request Oct 2, 2023
Closes #363 (final nail in the coffin)

This provides an alternative to
#429 that makes more
explicit the chaining operations done on the metadata and prefix maps.
This is also a good change to carefully document the way that this is
handled, since I might not have captured it accurately. As it is, The
priority order for combining prefix maps are:

1. Internal prefix map inside the document
2. Prefix map passed through this function inside the ``meta``
3. Prefix map passed through this function to ``prefix_map``
4. Default prefix map (handled with ensure_converter)
@cthoyt
Copy link
Member

cthoyt commented Oct 2, 2023

@hrshdhgd I think this is now solved with #431. Maybe you can rebase then see if you can add some more tests to make sure

Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now 100% against adding any ad hoc code that works on prefixes or prefix maps into sssom-py.

If people add incorrect prefixes that aren't defined, why don't we give them appropriate warnings/errors?

Besides, this code shouldn't be necessary to address the original issue, which was something to do with incorrectly writing SSSOM files that was fixed in another PR

Copy link
Collaborator

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a failing test here that roundtrips through TSV->JSON->TSV before we do anything else? I think the problem is somewhere deeper, but perhaps the solution is somewhere along the lines of https://github.com/mapping-commons/sssom-py/pull/431/files#diff-aed80c160ecfb8db6a235671a6d4ed0ae74470301705533ff4ec2a7a36dd989fR444

@hrshdhgd hrshdhgd closed this Oct 3, 2023
@hrshdhgd hrshdhgd deleted the issue-321 branch October 3, 2023 19:49
@hrshdhgd
Copy link
Contributor Author

hrshdhgd commented Oct 3, 2023

Charlie's test validates the issue:

class TestParseExplicit(unittest.TestCase):
"""This test case contains explicit tests for parsing."""
def test_round_trip(self):
"""Explicitly test round tripping."""
rows = [
(
"DOID:0050601",
"ADULT syndrome",
"skos:exactMatch",
"UMLS:C1863204",
"ADULT SYNDROME",
"semapv:ManualMappingCuration",
"orcid:0000-0003-4423-4370",
)
]
columns = [
"subject_id",
"subject_label",
"predicate_id",
"object_id",
"object_label",
"mapping_justification",
"creator_id",
]
df = pd.DataFrame(rows, columns=columns)
msdf = MappingSetDataFrame(df=df, converter=ensure_converter())
msdf.clean_prefix_map(strict=True)
#: This is a set of the prefixes that explicitly are used in this
#: example. SSSOM-py also adds the remaining builtin prefixes from
#: :data:`sssom.context.SSSOM_BUILT_IN_PREFIXES`, which is reflected
#: in the formulation of the test expectation below
explicit_prefixes = {"DOID", "semapv", "orcid", "skos", "UMLS"}
self.assertEqual(
explicit_prefixes.union(SSSOM_BUILT_IN_PREFIXES),
set(msdf.prefix_map),
)
with tempfile.TemporaryDirectory() as directory:
directory = Path(directory)
path = directory.joinpath("test.sssom.tsv")
with path.open("w") as file:
write_table(msdf, file)
_, read_metadata = _read_pandas_and_metadata(_open_input(path))
reconsitited_msdf = parse_sssom_table(path)
# This tests what's actually in the file after it's written out
self.assertEqual({CURIE_MAP, "license", "mapping_set_id"}, set(read_metadata))
self.assertEqual(DEFAULT_LICENSE, read_metadata["license"])
self.assertTrue(read_metadata["mapping_set_id"].startswith(f"{SSSOM_URI_PREFIX}mappings/"))
expected_prefix_map = {
"DOID": "http://purl.obolibrary.org/obo/DOID_",
"UMLS": "http://linkedlifedata.com/resource/umls/id/",
"orcid": "https://orcid.org/",
"owl": "http://www.w3.org/2002/07/owl#",
"rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
"rdfs": "http://www.w3.org/2000/01/rdf-schema#",
"semapv": "https://w3id.org/semapv/vocab/",
"skos": "http://www.w3.org/2004/02/skos/core#",
"sssom": "https://w3id.org/sssom/",
}
self.assertEqual(
expected_prefix_map,
read_metadata[CURIE_MAP],
)
# This checks that nothing funny gets added unexpectedly
self.assertEqual(expected_prefix_map, reconsitited_msdf.prefix_map)

@matentzn
Copy link
Collaborator

matentzn commented Oct 4, 2023

This test does not seem to have a JSON serialise and parse step in it though - what we need is a test that takes Damiens test case, serialises it to JSON, then parses the JSON, then ensures that pre JSON and post JSON msdf objects are the same..

@hrshdhgd hrshdhgd restored the issue-321 branch December 13, 2023 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JSON serialisation format is woefully underspecified
3 participants