Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of URI templates seems problematic #259

Open
annevk opened this issue Mar 3, 2025 · 21 comments
Open

Use of URI templates seems problematic #259

annevk opened this issue Mar 3, 2025 · 21 comments

Comments

@annevk
Copy link
Member

annevk commented Mar 3, 2025

Browsers don't implement a URI parser, let alone a URI template parser. Instead they use https://url.spec.whatwg.org and https://urlpattern.spec.whatwg.org. I think it would be better to build upon those primitives.

@garretrieger
Copy link
Contributor

Thanks for raising this. For our usage parsing is limited just to finding and replacing the substitution expressions as defined in rfc6570 which should be simpler then general URL parsing.
Beyond that the actual URLs are just treated as opaque strings which are passed on to fetch to resolve.

We definitely want to make sure we're reusing functionality that's already present in browsers where possible though. I reviewed the two linked specs, but it doesn't look to me like those provide template substitution functionality. Is there possibly another spec thats in use in browsers which provides template substitution functionality and would be preferable to rfc6570?

@annevk
Copy link
Member Author

annevk commented Mar 4, 2025

You are correct. This would require a prose-version of whatwg/urlpattern#73. @jeremyroman @sisidovski do you think you could add that to URLPattern so the IFT specification can build upon URLPattern instead of URI templates?

@sisidovski
Copy link

I'm still catching up with the IFT. @garretrieger Could you help me understand how the URI templates are used in IFT?

@svgeesus
Copy link
Contributor

svgeesus commented Mar 4, 2025

They are used to efficiently store the url of font patches, which can be downloaded to extend an already-downloaded font.

@garretrieger
Copy link
Contributor

garretrieger commented Mar 4, 2025

To expand on that, the idea is this:

  • Inside the font we have data (patch map) which lists a set of patch files that are available and the URLs at which those patches reside.
  • To minimize the cost of storing potentially hundreds of URLs we assign each patch a numeric ID which then gets substituted into a URL template to produce the final URL for each patch.
  • So for example we might have an IFT font with the following mapping table:
URL Template: https://foo.bar/patches/{id}.patch
Patch 1: load when any of unicode code points a through m are present.
...
Patch 123: load when any of unicode code points n through z are present.
  • Then if the codepoint p is present, the client would determine that patch 123 needs to be loaded. Next, it will generate the URL to load by substituting 123 into the url template to produce: https://foo.bar/patches/FC.patch (note: the numeric id 123 gets base32 encoded to 'FC' prior to substitution).

A couple of things to note: the IFT client doesn't parse anything in the URL template other then locating the {id} substitution point. Further parsing/fetching of the URL is delegated to fetch.

@sisidovski
Copy link

Thank you, that is helpful.

Regarding the format, from the spec, it looks patch URLs could have multiple substitutions like this.

//foo.bar{/d1,d2,id} 478 Integer //foo.bar/0/F/07F0
//foo.bar{/d1,d2,d3,id} 123 Integer //foo.bar/C/F/_/FC

In URLPattern, substitutions are expressed like //foo.bar/:d1/:d2/:id.

It looks there are only six substitutions, all of them are fixed characters and not configurable? https://w3c.github.io/IFT/Overview.html#uri-templates If so, I wonder if IFT really wants whatwg/urlpattern#73, as the use case is pretty limited.

@annevk
Copy link
Member Author

annevk commented Mar 7, 2025

Yeah, I had missed that the template is fixed. An alternative here might be that IFT just builds the necessary URL paths itself with a prose-written algorithm. It would have to handle certain percent-encoding details, but it shouldn't be too much work and allows removing this dependency. (At least as a normative reference, I think for illustrative purposes it might still be useful to call out the equivalent URI template (or eventual URLPattern template, once we get there).)

@skef
Copy link
Contributor

skef commented Mar 7, 2025

I believe the reason we reference the URI template mechanism is to make it clear how the templates described in the specification are intentionally compatible with it. We had a need for a fill-in template and it seemed more sensible to base the one in the spec off an actual standard rather than rolling our own (which is what an earlier prototype did). It was never about requiring the addition of a full parser.

Our current prototype does use https://docs.rs/uri-template-system/latest/uri_template_system/ but that's more a matter of convenience than anything else.

An alternative here might be that IFT just builds the necessary URL paths itself with a prose-written algorithm. It would have to handle certain percent-encoding details, but it shouldn't be too much work and allows removing this dependency.

Suppose we were to change the language to reference RFC6570 but note that because the six variables are fixed, it is fine to use a custom algorithm that only supports those variables as long as it produces the same result. Would that be sufficient? There are many ways of accomplishing the substitutions and proscribing an algorithm in the spec would take the other options off the table. (For example, what if there is a URI template implementation available in the client?)

@annevk
Copy link
Member Author

annevk commented Mar 7, 2025

@skef it depends on whether the URLs that end up being produced would be identical or not. E.g., is the percent-encoding lowercase or uppercase, etc. It's probably easier to accomplish that by just writing something in terms of the URL standard, but if someone verified it, maybe it could work.

@garretrieger
Copy link
Contributor

garretrieger commented Mar 7, 2025

I'd prefer to stick with rfc6570 versus writing our own algorithm. The standard is mature, has a wide variety of good quality implementations already, an existing test suite, and does pretty much exactly what we need it too for our use case.

It's important to note that an implementation of rfc6570 doesn't require doing general URL parsing. An implementation only needs to search for and replace substitution expressions (identified by {...}) and anything outside of those expressions is just blindly copied or percent encoded if the codepoints aren't url safe.

For context here's a couple of relevant quotes from the spec text: "The syntax is designed to be trivial to parse while at the same time providing enough flexibility to express many common template scenarios." and "The process of URI Template expansion is to scan the template string from beginning to end, copying literal characters and replacing each expression with the result of applying the expression's operator to the value of each variable named in the expression." An expression here refers to anything enclosed in {...}.

In my opinion writing our own template syntax and expansion algorithm would likely introduce a similar level of complexity to rfc6570 and since expanding templates in this fashion doesn't require understanding the structure of the literal URL string surrounding the substitution expressions I don't think we need to frame anything in terms of the URL standard. What we're using rfc6570 for is essentially just string substitution and percent encoding where needed.

If there's a desire to limit some of the more complex substitution types found in the spec we could look at limiting the level of expression support required by for an IFT implementation. The rfc defines 4 levels of substitution expressions ranging from simple (l1) to complex (l4) (see: https://datatracker.ietf.org/doc/html/rfc6570#section-1.2). For our use cases it would likely be sufficient to only require level 1 and a subset of the level 3 operators ( /, ?, &, ;). Level 3 operators aren't strictly needed, but do allow for more compact templates for some expected use cases.

@annevk
Copy link
Member Author

annevk commented Mar 8, 2025

I think given how IFT uses this building up a path and calling percent-encode would not be that much complexity.

What I'm worried about is that 6570 does not define for instance whether percent-encoding happens with uppercase or lowercase alpha digits. That is a problem.

The code points that get percent-encoded might also differ from the percent-encode sets defined by the URL standard. That's a potential issue as I did not try to confirm this one, the prior one seems substantive enough on its own.

@garretrieger
Copy link
Contributor

garretrieger commented Mar 10, 2025

I think given how IFT uses this building up a path and calling percent-encode would not be that much complexity.

What I'm worried about is that 6570 does not define for instance whether percent-encoding happens with uppercase or lowercase alpha digits. That is a problem.

Ah I didn't notice this, that is some what of an issue. I dug into this a bit more, the spec references rfc3986 for percent encoding, which does indicate a preference ("should") for uppercase when producing URLs. There's a few examples throughout 6570 that all use upper case and the official test cases consistently use upper case too. Based on that I think it's safe to assume the intention is for the output to use uppercase. For our use in IFT I think it would be reasonable to specifically clarify this issue and explicitly require upper case in any percent encoding in the produced URLs.

The code points that get percent-encoded might also differ from the percent-encode sets defined by the URL standard. That's a potential issue as I did not try to confirm this one, the prior one seems substantive enough on its own.

I'll need to look into this one a bit more. Would https://url.spec.whatwg.org/ be the appropriate thing to compare too?

@annevk
Copy link
Member Author

annevk commented Mar 10, 2025

Yes.

@garretrieger
Copy link
Contributor

Spent some time reviewing URL, rfc6570, and how those intersect with what we're trying to do in IFT and here's what I've concluded:

  • In the IFT spec we currently utilize rfc3986 when referring to URIs and for doing reference resolution (see load patch file). Since URL obsoletes rfc3986 we should rewrite that section in terms of URL. Specifically the output of template expansion will be passed to URL Parsing to convert into a URL and to resolve relative URLs.
  • So then the overall processing in IFT looks like this:
    • template string and entry id gets expanded into an ascii string.
    • ascii string and base font url gets fed into URL parsing to produce a URL.
    • URL is passed to fetch to as part of the request object.
  • With that in mind then we need to ensure that the output of template expansion is compatible with URL Parsing. The input to URL parsing is a utf8 encoded string and since the output from expansion is gauranteed to be ascii we are good on that front.
    • Note: we don't require that template expansion results in a valid URL string. That's up to the encoder implementation to ensure the template it's using will result in valid URL strings. If invalid URLs are produced they will be rejected during URL parsing.
  • As for percent encoding it looks to me like the percent encoding specified in rfc6570 matches the expectations in URL, both utf8 encode the unicode code point and then percent encode the resulting octets. URL accepts both uppercase and lowercase in percent encodings, but for consistency we should update the IFT spec text to require uppercase during expansion.

What do you think?

@annevk
Copy link
Member Author

annevk commented Mar 12, 2025

If you can only produce paths I don't think the URL parser will ever fail (note that it doesn't necessarily fail for invalid input).

Also, for percent-encoding there's a percent-encode set you need to decide on for all the code points in the ASCII range. https://url.spec.whatwg.org/#example-percent-encode-operations has an example that goes over the various caller options.

@sisidovski
Copy link

I wonder how substitution expressions will look like. Once we implement and standardize whatwg/urlpattern#73, are the planned substitution expressions compatible with the URLPattern?

@annevk
Copy link
Member Author

annevk commented Mar 12, 2025

To my best understanding it doesn't really matter as the template is used as a convenient specification-internal way to generate a set of paths. It's not directly exposed. (I also think that directly calling the relevant percent-encode operations in the URL standard and appending the resulting strings to a list which is then used as the path would be more precise and likely have less overall complexity.)

@garretrieger
Copy link
Contributor

garretrieger commented Mar 12, 2025

If you can only produce paths I don't think the URL parser will ever fail (note that it doesn't necessarily fail for invalid input).

For our use template expansion is not limited to just paths. The IFT spec allows them to expand to either relative or absolute URLs.

Also, for percent-encoding there's a percent-encode set you need to decide on for all the code points in the ASCII range. https://url.spec.whatwg.org/#example-percent-encode-operations has an example that goes over the various caller options.

I did some more checking on what differences exist in which codepoints will be percent encoded when expanding the template (looking at literals only since we know the substitution values will only contain base64/base32 chars). If you intersect the allowed literal set with the set that are to required to be percent encoded you end up with 0x00-0x20 and everything >=0x7F. Which is a subset of all percent encoding sets used during URL parsing, and as a result I believe it's correct that these are always percent encoded.

For everything else (0x21-0x7E) this effectively pushes percent encoding decisions onto the creator of the template. For example "?" should be percent encoded if part of a path segment, but would be un-encoded when delimiting the query string. The template creator would be responsible for ensuring that percent encoding has been correctly applied so that expansions result in valid URLs that point to the intended resources.

@annevk
Copy link
Member Author

annevk commented Mar 12, 2025

I misunderstood. I think the problem @sisidovski highlights is real. At least it seems from https://w3c.github.io/IFT/Overview.html#patch-map-format-1 that one of the inputs is a URI template. That seems more problematic to me as that essentially requires a URI template implementation.

garretrieger added a commit that referenced this issue Mar 18, 2025
- For #259 this significantly reduces the complexity of expansion implementation needed by clients.
- Note that percent encoding must only produce upper case letters.
- Add a note that implementations should consider a simple custom implementation of expansion over reusing a general purpose one.
garretrieger added a commit that referenced this issue Mar 18, 2025
- For #259 this significantly reduces the complexity of expansion implementation needed by clients.
- Note that percent encoding must only produce upper case letters.
- Add a note that implementations should consider a simple custom implementation of expansion over reusing a general purpose one.
garretrieger added a commit that referenced this issue Mar 18, 2025
- For #259 this significantly reduces the complexity of expansion implementation needed by clients.
- Note that percent encoding must only produce upper case letters.
- Add a note that implementations should consider a simple custom implementation of expansion over reusing a general purpose one.
garretrieger added a commit that referenced this issue Mar 20, 2025
- For #259 this significantly reduces the complexity of expansion implementation needed by clients.
- Note that percent encoding must only produce upper case letters.
- Add a note that implementations should consider a simple custom implementation of expansion over reusing a general purpose one.
garretrieger added a commit that referenced this issue Mar 20, 2025
- For #259 this significantly reduces the complexity of expansion implementation needed by clients.
- Note that percent encoding must only produce upper case letters.
- Add a note that implementations should consider a simple custom implementation of expansion over reusing a general purpose one.
garretrieger added a commit that referenced this issue Mar 20, 2025
- For #259 this significantly reduces the complexity of expansion implementation needed by clients.
- Note that percent encoding must only produce upper case letters.
- Add a note that implementations should consider a simple custom implementation of expansion over reusing a general purpose one.
@garretrieger
Copy link
Contributor

Small update here, I've changed the IFT spec to utilize whatwg URL instead of rfc3986 and restricted template syntax to only level 1 which makes for a fairly simple expansion implementation (see: #263). I'll continue to watch the progress on whatwg/urlpattern#73 and once that functionality becomes available we can look into switching over to it instead of rfc6570 style templates.

@svgeesus
Copy link
Contributor

@annevk does that update resolve your comment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants