Convert SSSOM TSV => JSON => TSV and confirm the `MappingSetDataFrame` object remains consistent. #429

hrshdhgd · 2023-09-26T00:32:39Z

Given a simple sssom tsv

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  ORCID: "https://orcid.org/"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#creator_id:
#  - "ORCID:0000-0002-6095-8718"
subject_id	subject_label	predicate_id	object_id	mapping_justification
FBbt:00000001	organism	semapv:crossSpeciesExactMatch	UBERON:0000468	semapv:ManualMappingCuration

I first parse it using parse_sssom_table()
msdf1.clean_prefix_map()
Convert to JSON using to_json() and write to file
Parse again using parse_sssom_json()
msdf2.clean_prefix_map()

The prefix maps of both should be the same.

…remains consistent.

hrshdhgd · 2023-09-26T00:49:59Z

At the moment this fails. Here's the reason why:

Here's the test code:

sssom-py/tests/test_parsers.py

Lines 250 to 265 in 831a58e

    
           def test_tsv_to_json_and_back(self): 
        
               """Test converting SSSOM TSV => JSON => SSSOM TSV such that it is reproducible.""" 
        
               sample_tsv = f"{test_data_dir}/sample1.sssom.tsv" 
        
               json_outfile = f"{test_out_dir}/sample1.json" 
        
               msdf1 = parse_sssom_table(sample_tsv) 
        
               msdf1.clean_prefix_map() 
        
               json_doc = to_json(msdf1) 
        
               self.assertEqual(msdf1.prefix_map, json_doc["@context"]) 
        
               with open(json_outfile, "w") as file: 
        
                   write_json(msdf1, file) 
        
               msdf2 = parse_sssom_json(json_outfile) 
        
               msdf2.clean_prefix_map() 
        
               self.assertEqual(msdf1.prefix_map, msdf2.prefix_map)

msdf1.prefix_map looks like this:

{
'owl': 'http://www.w3.org/2002/07/owl#', 
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 
'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
'semapv': 'https://w3id.org/semapv/vocab/', 
'skos': 'http://www.w3.org/2004/02/skos/core#',
'sssom': 'https://w3id.org/sssom/', 
'FBbt': 'http://purl.obolibrary.org/obo/FBbt_', 
'ORCID': 'https://orcid.org/', 
'UBERON': 'http://purl.obolibrary.org/obo/UBERON_'
}

whereas msdf2.prefix_map looks like this

{
'owl': 'http://www.w3.org/2002/07/owl#', 
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 
'rdfs': 'http://www.w3.org/2000/01/rdf-schema#', 
'semapv': 'https://w3id.org/semapv/vocab/', 
'skos': 'http://www.w3.org/2004/02/skos/core#', 
'sssom': 'https://w3id.org/sssom/', 
'FBbt': 'http://purl.obolibrary.org/obo/FBbt_', 
'UBERON': 'http://purl.obolibrary.org/obo/UBERON_'
}

'ORCID': 'https://orcid.org/', is missing from msdf2.prefix_map

In the clean_prefix_map() function, the prefixes_in_table variable for msdf1 = {'owl', 'rdfs', 'skos', 'sssom', 'UBERON', 'ORCID', 'semapv', 'rdf', 'FBbt'} which is the same for msdf2. In spite of that, the self.prefix_map generated for both are different.

self.prefix_map = dict(subconverter.bimap)

This line is where the ORCID prefix gets lost:

subconverter = self.converter.get_subconverter(prefixes_in_table)

@cthoyt , would you have any idea why this would be happening?

cc: @matentzn , @gouttegd

cthoyt · 2023-09-26T06:25:05Z

src/sssom/parsers.py

@@ -212,21 +212,28 @@ def parse_sssom_table(
                    logging.info(f"Externally provided metadata {k}:{v} is added to metadata set.")
                    sssom_metadata[k] = v
        meta = sssom_metadata
-
-        if "curie_map" in sssom_metadata:
+        if CURIE_MAP in sssom_metadata:


I have rewritten all of this code using curies already and am waiting for #426 (ready) then #401 (needs some more work) to get merged. I think it's better that you reduce effort on re-writing it again, since the curies solution is much more clean and elegant.

I split the code for updating this into #431

cthoyt · 2023-09-26T06:27:12Z

src/sssom/parsers.py

+    # This takes priority over default prefix_map in case of a tie.
+    jsondoc_prefix_map = jsondoc["@context"]
+
+    # Convert keys in both maps to lower case for comparison


Is this wise? I think we should not be doing all of this lowercasing. The prefix maps should be correct as they are. Doing all of these ad-hoc operations makes sssom-py insanely hard to work with (I can tell you this from first hand experience since I have tried to understand the whole history of the package)

cthoyt

Please wait on making progress with this until the PRs that replace a lot of the same code with curies are merged - hopefully this encourages much more consistent and elegant handling of prefix maps and reduces the number of changes to address the issue here

cthoyt · 2023-09-26T06:52:28Z

tests/test_parsers.py

+        """Test converting SSSOM TSV => JSON => SSSOM TSV such that it is reproducible."""
+        sample_tsv = f"{test_data_dir}/sample1.sssom.tsv"
+        json_outfile = f"{test_out_dir}/sample1.json"
+        msdf1 = parse_sssom_table(sample_tsv)


@hrshdhgd other hints: when writing tests, make the things you're testing more explicit. The way this test is written, it doesn't really test anythign specific. Why not explicitly check that ORCID is in the places you expect to to be? Otherwise, someone reading this test learns nothing about what's actually supposed to be happening here.

cthoyt · 2023-09-26T06:57:19Z

tests/test_parsers.py

+            write_json(msdf1, file)
+
+        msdf2 = parse_sssom_json(json_outfile)
+        self.assertIn("ORCID", msdf2.prefix_map)


I added some more explicit tests so you can see where things are actually going wrong, which appers to be in the SSSOM JSON parsing, and has nothing to do with cleaning the prefix map like I suggested in a previous comment

This provides an alternative to #429 that makes more explicit the chaining operations done on the metadata and prefix maps This is also a good change to carefully document the way that this is handled, since I might not have captured it accurately

hrshdhgd · 2023-09-26T13:13:52Z

@cthoyt , thanks for the insight! This branch was just to investigate the issue Damien pointed out (hence the draft). As for the lowercasing, I don't like that solution at all. I was just trying to understand the issue better so I could explain it to someone else via code (you and Nico). Also, I assume these files will change once your PRs are merged so I have no intention to continue this branch. May start afresh or merge main into this branch first IF the problem persists.

Closes #363 (final nail in the coffin) This provides an alternative to #429 that makes more explicit the chaining operations done on the metadata and prefix maps. This is also a good change to carefully document the way that this is handled, since I might not have captured it accurately. As it is, The priority order for combining prefix maps are: 1. Internal prefix map inside the document 2. Prefix map passed through this function inside the ``meta`` 3. Prefix map passed through this function to ``prefix_map`` 4. Default prefix map (handled with ensure_converter)

cthoyt · 2023-10-02T14:28:38Z

@hrshdhgd I think this is now solved with #431. Maybe you can rebase then see if you can add some more tests to make sure

cthoyt

I am now 100% against adding any ad hoc code that works on prefixes or prefix maps into sssom-py.

If people add incorrect prefixes that aren't defined, why don't we give them appropriate warnings/errors?

Besides, this code shouldn't be necessary to address the original issue, which was something to do with incorrectly writing SSSOM files that was fixed in another PR

matentzn

Can you add a failing test here that roundtrips through TSV->JSON->TSV before we do anything else? I think the problem is somewhere deeper, but perhaps the solution is somewhere along the lines of https://github.com/mapping-commons/sssom-py/pull/431/files#diff-aed80c160ecfb8db6a235671a6d4ed0ae74470301705533ff4ec2a7a36dd989fR444

hrshdhgd · 2023-10-03T19:51:26Z

Charlie's test validates the issue:

sssom-py/tests/test_parsers.py

Lines 256 to 325 in e06a4c6

    
           class TestParseExplicit(unittest.TestCase): 
        
               """This test case contains explicit tests for parsing.""" 
        
               def test_round_trip(self): 
        
                   """Explicitly test round tripping.""" 
        
                   rows = [ 
        
                       ( 
        
                           "DOID:0050601", 
        
                           "ADULT syndrome", 
        
                           "skos:exactMatch", 
        
                           "UMLS:C1863204", 
        
                           "ADULT SYNDROME", 
        
                           "semapv:ManualMappingCuration", 
        
                           "orcid:0000-0003-4423-4370", 
        
                       ) 
        
                   ] 
        
                   columns = [ 
        
                       "subject_id", 
        
                       "subject_label", 
        
                       "predicate_id", 
        
                       "object_id", 
        
                       "object_label", 
        
                       "mapping_justification", 
        
                       "creator_id", 
        
                   ] 
        
                   df = pd.DataFrame(rows, columns=columns) 
        
                   msdf = MappingSetDataFrame(df=df, converter=ensure_converter()) 
        
                   msdf.clean_prefix_map(strict=True) 
        
                   #: This is a set of the prefixes that explicitly are used in this 
        
                   #: example. SSSOM-py also adds the remaining builtin prefixes from 
        
                   #: :data:`sssom.context.SSSOM_BUILT_IN_PREFIXES`, which is reflected 
        
                   #: in the formulation of the test expectation below 
        
                   explicit_prefixes = {"DOID", "semapv", "orcid", "skos", "UMLS"} 
        
                   self.assertEqual( 
        
                       explicit_prefixes.union(SSSOM_BUILT_IN_PREFIXES), 
        
                       set(msdf.prefix_map), 
        
                   ) 
        
                   with tempfile.TemporaryDirectory() as directory: 
        
                       directory = Path(directory) 
        
                       path = directory.joinpath("test.sssom.tsv") 
        
                       with path.open("w") as file: 
        
                           write_table(msdf, file) 
        
                       _, read_metadata = _read_pandas_and_metadata(_open_input(path)) 
        
                       reconsitited_msdf = parse_sssom_table(path) 
        
                   # This tests what's actually in the file after it's written out 
        
                   self.assertEqual({CURIE_MAP, "license", "mapping_set_id"}, set(read_metadata)) 
        
                   self.assertEqual(DEFAULT_LICENSE, read_metadata["license"]) 
        
                   self.assertTrue(read_metadata["mapping_set_id"].startswith(f"{SSSOM_URI_PREFIX}mappings/")) 
        
                   expected_prefix_map = { 
        
                       "DOID": "http://purl.obolibrary.org/obo/DOID_", 
        
                       "UMLS": "http://linkedlifedata.com/resource/umls/id/", 
        
                       "orcid": "https://orcid.org/", 
        
                       "owl": "http://www.w3.org/2002/07/owl#", 
        
                       "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#", 
        
                       "rdfs": "http://www.w3.org/2000/01/rdf-schema#", 
        
                       "semapv": "https://w3id.org/semapv/vocab/", 
        
                       "skos": "http://www.w3.org/2004/02/skos/core#", 
        
                       "sssom": "https://w3id.org/sssom/", 
        
                   } 
        
                   self.assertEqual( 
        
                       expected_prefix_map, 
        
                       read_metadata[CURIE_MAP], 
        
                   ) 
        
                   # This checks that nothing funny gets added unexpectedly 
        
                   self.assertEqual(expected_prefix_map, reconsitited_msdf.prefix_map)

matentzn · 2023-10-04T10:39:40Z

This test does not seem to have a JSON serialise and parse step in it though - what we need is a test that takes Damiens test case, serialises it to JSON, then parses the JSON, then ensures that pre JSON and post JSON msdf objects are the same..

Convert SSSOM TSV => JSON => TSV and confirm the MappingSetDataFrame …

831a58e

…remains consistent.

hrshdhgd changed the title ~~Convert SSSOM TSV => JSON => TSV and confirm the MappingSetDataFrame remains consistent.~~ Convert SSSOM TSV => JSON => TSV and confirm the MappingSetDataFrame object remains consistent. Sep 26, 2023

cthoyt reviewed Sep 26, 2023

View reviewed changes

cthoyt requested changes Sep 26, 2023

View reviewed changes

cthoyt reviewed Sep 26, 2023

View reviewed changes

Add more explicit tests

1854d29

cthoyt reviewed Sep 26, 2023

View reviewed changes

cthoyt mentioned this pull request Sep 26, 2023

Simplify parsing SSSOM w/ curies.chain #431

Merged

Merge branch 'master' into issue-321

e00f4d4

cthoyt requested changes Oct 3, 2023

View reviewed changes

matentzn requested changes Oct 3, 2023

View reviewed changes

hrshdhgd closed this Oct 3, 2023

hrshdhgd deleted the issue-321 branch October 3, 2023 19:49

hrshdhgd restored the issue-321 branch December 13, 2023 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert SSSOM TSV => JSON => TSV and confirm the `MappingSetDataFrame` object remains consistent. #429

Convert SSSOM TSV => JSON => TSV and confirm the `MappingSetDataFrame` object remains consistent. #429

hrshdhgd commented Sep 26, 2023 •

edited

Loading

hrshdhgd commented Sep 26, 2023 •

edited

Loading

cthoyt Sep 26, 2023

cthoyt Sep 26, 2023

cthoyt Sep 26, 2023

cthoyt left a comment

cthoyt Sep 26, 2023

cthoyt Sep 26, 2023

hrshdhgd commented Sep 26, 2023 •

edited

Loading

cthoyt commented Oct 2, 2023

cthoyt left a comment •

edited

Loading

matentzn left a comment

hrshdhgd commented Oct 3, 2023

matentzn commented Oct 4, 2023

Convert SSSOM TSV => JSON => TSV and confirm the MappingSetDataFrame object remains consistent. #429

Convert SSSOM TSV => JSON => TSV and confirm the MappingSetDataFrame object remains consistent. #429

Conversation

hrshdhgd commented Sep 26, 2023 • edited Loading

hrshdhgd commented Sep 26, 2023 • edited Loading

cthoyt Sep 26, 2023

Choose a reason for hiding this comment

cthoyt Sep 26, 2023

Choose a reason for hiding this comment

cthoyt Sep 26, 2023

Choose a reason for hiding this comment

cthoyt left a comment

Choose a reason for hiding this comment

cthoyt Sep 26, 2023

Choose a reason for hiding this comment

cthoyt Sep 26, 2023

Choose a reason for hiding this comment

hrshdhgd commented Sep 26, 2023 • edited Loading

cthoyt commented Oct 2, 2023

cthoyt left a comment • edited Loading

Choose a reason for hiding this comment

matentzn left a comment

Choose a reason for hiding this comment

hrshdhgd commented Oct 3, 2023

matentzn commented Oct 4, 2023

Convert SSSOM TSV => JSON => TSV and confirm the `MappingSetDataFrame` object remains consistent. #429

Convert SSSOM TSV => JSON => TSV and confirm the `MappingSetDataFrame` object remains consistent. #429

hrshdhgd commented Sep 26, 2023 •

edited

Loading

hrshdhgd commented Sep 26, 2023 •

edited

Loading

hrshdhgd commented Sep 26, 2023 •

edited

Loading

cthoyt left a comment •

edited

Loading