Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove pymongo dependency from this project #771

Merged
merged 7 commits into from
Jun 6, 2024
Merged

Remove pymongo dependency from this project #771

merged 7 commits into from
Jun 6, 2024

Conversation

squeaky-pl
Copy link
Contributor

@squeaky-pl squeaky-pl commented May 29, 2024

The original authors of this project chose to use pymongo's bson utils to handle JSON serialization and deserialization into various VARCHAR types in MySQL.

We have an unresolved security issue related to pymongo.

image

I've had questions in the past why are we using pymongo here, it suggests that sync-engine uses MongoDB while it does not. I think it's time to break that dependency because it just brings confusion, and pymongo will continue to receive security patches often as it contains C code which makes it prone to security bugs.

MongoDB's extended JSON syntax can serialize more types than stdlib's json module. The docs.

Pymongo's implementation sits in here https://github.com/mongodb/mongo-python-driver/blob/18328a909545ece6e1cd7e172e28271a59e367d5/bson/json_util.py. As you can see it handles many extra types. I analyzed all the call sites and data in those columns for \$\w references and concluded that we are only using "field": {"$date": <timestmp>} syntax in sync-engine.

Example

{"sync_disabled_by": null, "sync_start_time": {"$date": 1717004873429}, "original_start_time": {"$date": 1522779878389}, "sync_error": null, "sync_disabled_on": null, "sync_end_time": null, "sync_disabled_reason": null}

I've vendored the code and then stripped it to only include the parts that we are using.

Separately I think it makes sense to completely get rid of this $date syntax as I don't find it particularly readable.

I've already tested those changes first on my account in production and also on an entire cluster (nylas-e) and did not see any problems.

My intenral monologue about columns using JSON types
SELECT DISTINCT extra_args FROM (SELECT extra_args from actionlog a LIMIT 500000) As a;
extra_args
{"unread": false}
{"added_labels": [], "removed_labels": ["\Inbox"]}
{"destination": "INBOX.Archive"}
{"destination": "INBOX"}
{"added_labels": ["\Trash"], "removed_labels": ["\Inbox"]}
{"added_labels": ["\Inbox"], "removed_labels": []}
{"added_labels": [], "removed_labels": ["\Inbox", "\Trash"]}
{"added_labels": ["\Important", "\All"], "removed_labels": []}
{"added_labels": ["\Important"], "removed_labels": []}
SELECT account._sync_status FROM account LIMIT 1;
_sync_status
{"sync_disabled_by": null, "sync_start_time": {"$date": 1717004873429}, "original_start_time": {"$date": 1522779878389}, "sync_error": null, "sync_disabled_on": null, "sync_end_time": null, "sync_disabled_reason": null}
SELECT DISTINCT REGEXP_SUBSTR(account._sync_status, "\\$\\w+") FROM account WHERE account._sync_status LIKE "%$%";
REGEXP_SUBSTR(account._sync_status, "\$\w+")
$date
SELECT DISTINCT extra_flags FROM (SELECT extra_flags from imapuid a LIMIT 50000) As a;
extra_flags
[]
["$NotJunk", "NotJunk"]
["$Forwarded", "Forwarded"]
["$Forwarded", "$NotJunk", "NotJunk"]
["$Forwarded"]
["$Phishing"]
["$NotJunk", "JunkRecorded", "NotJunk"]
["$Forwarded", "$NotJunk", "Forwarded", "JunkRecorded", "NotJunk"]
["$MailFlagBit0", "$NotJunk", "JunkRecorded", "NotJunk"]
["$MailFlagBit0", "$MailFlagBit2"]
["$MailFlagBit0"]
["$Forwarded", "$NotJunk", "Forwarded", "NotJunk"]
["$MailFlagBit0", "$MailFlagBit2", "$NotJunk", "JunkRecorded", "NotJunk"]
["$MailFlagBit1", "$NotJunk", "JunkRecorded", "NotJunk"]
["$MailFlagBit1"]
["$MailFlagBit0", "$MailFlagBit2", "$Phishing"]
SELECT DISTINCT g_labels FROM (SELECT g_labels from imapuid a LIMIT 500000) As a;
g_labels
[]
  • message.from_addr message.sender_addr message.reply_to message.to_addr message.cc_addr message.bcc_adr message.in_reply_to
    from_addr = Column(JSON, nullable=False, default=list)
    sender_addr = Column(JSON, nullable=True)
    reply_to = Column(JSON, nullable=True, default=list)
    to_addr = Column(JSON, nullable=False, default=list)
    cc_addr = Column(JSON, nullable=False, default=list)
    bcc_addr = Column(JSON, nullable=False, default=list)
    in_reply_to = Column(JSON, nullable=True)
SELECT from_addr, sender_addr, reply_to, to_addr, cc_addr, bcc_addr, in_reply_to FROM message LIMIT 100;

Looks like json encoded:
from_addr list[tuple[str, str]]
sender_addr list[tuple[str, str]]
reply_to list[tuple[str, str]]
to_addr list[tuple[str, str]]
cc_addr list[tuple[str, str]]
bcc_addr list[truple[str, str]]
in_reply_to str

This table is empty

list[{"status": enum, notes: str, guests: list[?], name: str, email: str}]

@squeaky-pl squeaky-pl changed the title json util Remove pymongo dependency from this project Jun 3, 2024
Comment on lines -11 to -13
# Monkeypatch to not include tz_info in decoded JSON.
# Kind of a ridiculous solution, but works.
json_util.EPOCH_AWARE = EPOCH_NAIVE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this was indeed ridiculous I just inlined the monkeypatched code into the vendored module.

@squeaky-pl squeaky-pl merged commit 2c3aad1 into master Jun 6, 2024
6 checks passed
@wojcikstefan wojcikstefan deleted the json-util branch August 22, 2024 13:03
@wojcikstefan
Copy link
Member

Thanks for doing this @squeaky-pl! Been long time coming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants