Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add website-dedicated special JS rewriting rules #422

Open
benoit74 opened this issue Nov 18, 2024 · 3 comments
Open

Add website-dedicated special JS rewriting rules #422

benoit74 opened this issue Nov 18, 2024 · 3 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@benoit74
Copy link
Collaborator

In some cases (e.g. openzim/zim-requests#1162, but I'm pretty sure #402 would need the same), we need to patch website JS so that it does not interfere badly once inside the ZIM.

Exisiting code is capable to perform these additional rewrites thanks to rewriting rules which are already used to prepare things for wombat operation.

I propose that (I'm not sure about these points):

  • we continue to embed these new rules in the codebase
    • pros:
      • it is way simpler to write code than to define a (JSON / YAML) formalism capable to handle all rules and especially all different kind of rewrite operations (replace, add_suffix, add_prefix, ...)
      • rules are available for everyone to use, and we can more easily benefit from community-sourced rules for websites we've not yet explored
      • rules are testable and tested
    • cons:
      • we need to release everytime we need a new rule
      • other users of warc2zim need to create a PR to have their own rules for their own websites
  • in addition to the rule, we define a regexp of JS ZIM paths which must be rewriten with this rule, so that these new rules activate only when needed and do not risk breaking random JS of another random website

WDYT?

@benoit74 benoit74 added enhancement New feature or request question Further information is requested labels Nov 18, 2024
@rgaudin
Copy link
Member

rgaudin commented Nov 18, 2024

What does the existing ones look like?

@benoit74
Copy link
Collaborator Author

Existing ones looks like this:

return [
# rewriting `eval(...)` - invocation
(re.compile(r"(?:^|\s)\beval\s*\("), replace_prefix_from(eval_str, "eval")),
# rewriting `x = eval` - no invocation
(re.compile(r"[=]\s*\beval\b(?![(:.$])"), replace("eval", "self.eval")),
# rewriting `.postMessage` -> `__WB_pmw(self).postMessage`
(re.compile(r"\.postMessage\b\("), add_prefix(".__WB_pmw(self)")),
# rewriting `location = ` to custom expression `(...).href =` assignement
(
re.compile(r"[^$.]?\s?\blocation\b\s*[=]\s*(?![\s\d=])"),
add_suffix_non_prop(check_loc),
),
# rewriting `return this`
(re.compile(r"\breturn\s+this\b\s*(?![\s\w.$])"), replace_this()),
# rewriting `this.` special porperties access on new line, with ; perpended
# if prev chars is `\n`, or if prev is not `.` or `$`, no semi
(
re.compile(
rf"[^$.]\s?\bthis\b(?=(?:\.(?:{'|'.join(GLOBAL_OVERRIDES)})\b))"
),
replace_this_non_prop(),
),
# rewrite `= this` or `, this`
(re.compile(r"[=,]\s*\bthis\b\s*(?![\s\w:.$])"), replace_this()),
# rewrite `})(this_rw)`
(re.compile(r"\}(?:\s*\))?\s*\(this\)"), replace_this()),
# rewrite this in && or || expr
(
re.compile(r"[^|&][|&]{2}\s*this\b\s*(?![|\s&.$](?:[^|&]|$))"),
replace_this(),
),
# ignore `async import`.
# As the rule will match first, it will prevent next rule matching `import` to
# be apply to `async import`.
(re.compile(r"async\s+import\s*\("), m2str(lambda x: x)),
# esm dynamic import, if found, mark as module
(
re.compile(r"[^$.]\bimport\s*\("),
replace_import("import", "____wb_rewrite_import__"),
),
]

And for banrepcultural encyclopedia I need to add something like this:

        (
            re.compile(r"var excludedRelativePaths=\['\/index.php','#'\];"),
            replace(
                "/index.php",
                "index.php",
            ),
        ),

@rgaudin
Copy link
Member

rgaudin commented Nov 18, 2024

And for banrepcultural encyclopedia I need to add something like this:

None of the other are content-specific and only ensure those calls are captured by Wombat.

Content-specific should not be blindly applied in my opinion. Too fragile.

I also think that for now it makes more sense to be code-driven. We need to gain experience with this and code has the flexibility we need. We can move to data-driven later, if that makes sense, once we're clear (from experience) about what needs to be defined and how.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants