Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database format 21: add JSON data format #1786

Draft
wants to merge 42 commits into
base: master
Choose a base branch
from
Draft

Database format 21: add JSON data format #1786

wants to merge 42 commits into from

Conversation

dsblank
Copy link
Member

@dsblank dsblank commented Oct 10, 2024

This PR adds a new column, "json_data" to all primary tables consisting of JSON data (just TEXT) encoded from the pickled blobs.

It doesn't remove the blob_data from existing databases, so that they can still be opened in earlier versions of gramps.

However, newly created databases no longer contain pickled data.

Work in progress.

@dsblank
Copy link
Member Author

dsblank commented Oct 10, 2024

One thing I am not sure about. If you need to update a table from version 20 or older, you need to open it with the old schema. How does that work?

@Nick-Hall
Copy link
Member

If we are moving from BLOBs to JSON then we should really use the new format. See PR #800.

The new format uses the to_json and from_json methods in the serialize module to build the json from the underlying classes. It comes with get_schema class methods which provide a JSON Schema that allow the validation that we already use in our unit tests.

The main benefit of the new format is that it is easier maintain and debug. Instead of lists we use dictionaries. So, for example, we refer to the field "parent_family_list" instead of field number 9.

Upgrades are no problem. We just read and write the raw data.

When I have more time I'll update you on discussion whilst you have been away.

@dsblank
Copy link
Member Author

dsblank commented Oct 11, 2024

Oh, that sounds like a great idea! I'll take a look at the JSON format and switch to that. Should work even better with the SQL JSON_EXTRACT().

@Nick-Hall
Copy link
Member

There are a few places where the new format is used, so we will get some bonus performance improvements.

Feel free to make changes to my existing code if you see a benefit.

You may also want to have a quick look at how we serialize GrampsType. Enough information is stored so that we can recreate the object, but I don't think that I chose to store all fields.

@dsblank
Copy link
Member Author

dsblank commented Oct 12, 2024

Making some progress. Turns out, the serialized format had leaked into many other places, probably for speed. Probably good candidates for business logic.

@dsblank
Copy link
Member Author

dsblank commented Oct 13, 2024

I added a to_dict() and from_dict() based on the to_json() and from_json(). I didn't know about the object hooks. Brilliant! That saves so much code.

@dsblank
Copy link
Member Author

dsblank commented Oct 13, 2024

@Nick-Hall , I will probably need your assistance regarding the complete save/load of the to_json and from_json functions. I looked at your PR but as it touches 590 files, there is a lot there.

In this PR, I can now upgrade a database, and load the people views (except for name functions which I have to figure out).

image

@Nick-Hall
Copy link
Member

@dsblank I have rebased PR #800 on the gramps51 branch. Only 25 files were actually changed.

You can also see the changes suggested by @prculley resulting from his testing and performance benchmarks.

@dsblank
Copy link
Member Author

dsblank commented Oct 13, 2024

Thanks @Nick-Hall, that was very useful. I think that I will cherry pick some of the changes (like attribute name changes, elimination of private attributes).

You'll see that I did many of the same changes you made. But, one thing I found is that if we want to allow upgrades from previous versions, then we need to be able to read in blob_data, and write out json_data. I think my version has that covered.

I'll continue to make progress.

@Nick-Hall
Copy link
Member

@dsblank Why are you removing the properties? The validation in the setters will no longer be called.

@dsblank
Copy link
Member Author

dsblank commented Oct 14, 2024

@Nick-Hall , I thought that was what @prculley did for optimization, and I thought was needed. I can put those back :)

@dsblank
Copy link
Member Author

dsblank commented Oct 28, 2024

Major milestone reached: database upgraded, and all of the views can now be rendered using the JSON raw data. Lots of work still to be done, but wanted to take a minute to update on the status.

image

@emyoulation
Copy link
Contributor

Is there something that could be posted to the Discourse forum to stir up some energy?

For instance, the Isotammi Filter+ gramplet shows a timer for filtering. Is there a filter (or custom filter) that would demonstrate the newly optimized difference when run on 5.2 vs. 5.3?

Posting a capture of Filter+ of the Example.gramps timings would encourage people to archive a similar test now in 5.2 of their real-world data... and again when 5.3 comes out. That might stir up some interest in optimizing the aging internal rules.

@dsblank
Copy link
Member Author

dsblank commented Oct 29, 2024

Is there something that could be posted to the Discourse forum to stir up some energy?

Probably not yet.

For instance, the Isotammi Filter+ gramplet shows a timer for filtering. Is there a filter (or custom filter) that would demonstrate the newly optimized difference when run on 5.2 vs. 5.3?

I haven't done much in the way of speeding up yet. This PR is merely a replacement for one representation over another. Tests in #1787 show that the two formats are similar in terms of timings.

Posting a capture of Filter+ of the Example.gramps timings would encourage people to archive a similar test now in 5.2 of their real-world data... and again when 5.3 comes out. That might stir up some interest in optimizing the aging internal rules.

That will be next! Nothing will change much until we can exploit the database level. I'll continue that work in #1785.

I'm going to start working on getting the tests to pass. Then this should be ready for review.

@Nick-Hall
Copy link
Member

@dsblank I'll fix the Date in the JSON schema for you tomorrow.

@dsblank
Copy link
Member Author

dsblank commented Oct 29, 2024

Oh great! I just hit bugs in the tests that I'm working on. Thanks!

@dsblank
Copy link
Member Author

dsblank commented Oct 30, 2024

Down to two categories of pytest errors and failures:

  1. vcard related failures
  2. 'Test Case Generator' and 'Check & Repair Database'

The second one seems like a random number change, but still tracking it down.

@Nick-Hall, I think this is all that is needed for Date fix: d2c301a#diff-de77e4fb6f5c704f314759c3ec0ceddea525f966ddf056e6ee1dee0ba852c5f3L760-L763

But do see that the difference between null and {'format': None, 'calendar': 0, 'modifier': 0, 'quality': 0, 'dateval': [0, 0, 0, False], 'text': '', 'sortval': 0, 'newyear': 0, '_class': 'Date'} is a big difference in length of json_object.

@Nick-Hall
Copy link
Member

@dsblank We should also update the JSON schema. See PR #1789. There is a unit test that validates the example database against the schema in schema_test.py.

The JSON representation of an empty date is longer than null, as you point out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants