-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[parser] Implement full featured CSS parser #2
Comments
@alexander-akait Thanks for raising this parser discussion! This is an exciting topic. So, do you think this issue covers stylelint/stylelint#5586? I think also |
Does PostCSS's author recognize these problems? And can we not add parser improvements upstream to PostCSS? Because Stylelint is currently a part of the PostCSS's ecosystem, I think PostCSS would be best for backward compatibility if PostCSS accepted requests for parser improvements. |
Thank you for bringing this up and for gathering all this info 🙇 I think performance of PostCSS and the CSS tooling ecosystem build around PostCSS is a complicated subject :) On the one hand PostCSS itself is really really fast. But the way it is designed means that there will be a lot of duplicate work on selectors, values, at-rule preludes, ... It also is written in JS, so it's bound by the constraints of JS engines. Imho it isn't realistically possible to create a new parser that solves everything :
If the constraint that is most important is performance, then it makes more sense to me that Rust or something similar is chosen as a starting point and that other aspects are sacrificed. However, what I consider to be the most valuable part of PostCSS is not performance but the community and existing adoption. There is a very large chance that people already have PostCSS as part of their stack, so the barrier to adding a tool based on PostCSS is low. There is also very little friction within the active PostCSS community, a lot of people are open to collaborate towards a common goal. ( like this here :) ) Starting a community from scratch around a new toolset is not something I am personally interested in :) Not that what I am or am not interested in should stop anyone
Yes : postcss/postcss#1145 It's a known issue that it is a waste that each plugin needs to parse values, selectors, at-rule preludes over and over again.
I tried and "succeeded" : postcss/postcss#1812 I wrote more about why I advised not merge that : postcss/postcss#1145 (comment) TL;DR; the cost of rolling out that change was too high. But it would have made it possible for multiple consumers to parse from an existing token array instead of starting from a string. For a tool that is mostly read heavy like Stylelint it would have meant a serious performance gain. PostCSS as a host/driver for plugins just works really well and the reason that it is successful is also the reason why we have a performance issue. By hiding a lot of complexity and only exposing a limited Object Model it is much easier to create a simple plugin. But it becomes harder to have a performant "tool chain". My current approach in If I want to create a fallback for the We might be able to do similar things in Stylelint? Something that I haven't tried yet, but that I think could work is to cache parsed values.
Each time you take something out of the cache the entry is removed. This would be extremely sensitive to bugs and any bug would be hard to fix. Also best to discuss this in it's own issue. My current goal with packages like Because it is unopinionated, follows the CSS specification, and doesn't support non-standard syntax it is also really stable. Either it implements the specification correctly or it has a bug and a bug can always be fixed in a patch release. (we might still do semver major from time to time, but these should be rare) Many things can also be done at the tokenizer level:
On top of the tokenizer there are the parser algorithms. They are currently limited and only implement the basics for component values. Ideally we extend these to cover more of the css syntax. These allow you to do more, because structures like blocks, functions are fully parsed. To actually have a useful Object Model another layer is needed, specialized parser which are only invoked when relevant. Things like the media query list parser. This has a complete Object Model but that is also what makes it massive. There are so many node types in this sub-syntax alone.
I don't personally use non-standard CSS syntax, everything is plain CSS in a file that has a My main reason not to support these is because they do not have a true standards body behind them and that I lack familiarity with these syntaxes. Correctly following one specification is difficult enough. But having said that, all tools I've created are composable and modular. I want people to be able to re-use the complex and hard parts. Some questions :
Footnotes
|
@romainmenke Thanks for sharing postcss/postcss#1145. Now I understand the context very well. 👍🏼 |
@romainmenke I'll try answering your questions as far as I know:
I don't remember completely, but this project may have some blockers due to insufficient parser libraries.
Unfortunately, I don't know. |
I've tried listing up parser libraries used by Stylelint. Some have almost not maintained 😓
Script used to create the table
import { spawnSync } from 'child_process';
const allDeps = JSON.parse(
spawnSync('npm', ['view', '--json', '[email protected]', 'dependencies']).stdout.toString(),
);
const parserDeps = [
'postcss',
'postcss-media-query-parser',
'postcss-resolve-nested-selector',
'postcss-safe-parser',
'postcss-selector-parser',
'postcss-value-parser',
'@csstools/css-parser-algorithms',
'@csstools/css-tokenizer',
'@csstools/media-query-list-parser',
'@csstools/selector-specificity',
'css-tree',
];
const dateFormat = new Intl.DateTimeFormat('en', { dateStyle: 'medium' });
const sizeFormat = new Intl.NumberFormat('en', { notation: 'compact' });
console.log(`| Name | Version | Last published | Unpacked size |`);
console.log(`|:-----|:--------|:---------------|---------------:|`);
for (const name of parserDeps) {
const version = allDeps[name];
if (!version) {
throw new Error(`${name} is not in dependencies`);
}
let dep = JSON.parse(
spawnSync('npm', ['view', '--json', `${name}@${version}`]).stdout.toString(),
);
if (Array.isArray(dep)) {
dep = dep.at(-1);
}
const lastPublished = dateFormat.format(new Date(dep.time[dep.version]));
const size = dep.dist.unpackedSize ? sizeFormat.format(dep.dist.unpackedSize) + 'B' : 'n/a';
console.log(
`| [\`${name}\`](https://www.npmjs.com/package/${name}) | ${dep.version} | ${lastPublished} | ${size} |`,
);
} EDIT: This list is at point of Stylelint 15.6.2 |
Problems with dependent parsers:
|
Of that list only these seem immediately problematic to me :
They have not been updated even when the CSS specifications that are relevant to them have changed years ago.
I really like the syntax checking it offers and it's not trivial to re-create this feature. |
Oh, there are a lot of messages
I full disagree:
By default CSS tokenizer is error resistance (and CSS parser) too, so we don't need to worry a lot of non standard CSS, because by spec it will be ListOfCompomentsValues if we can't apply grammar.
I propose not to parry to emotion, but to return to reality, if the tool is not going to solve problems and does not provide an opportunity to solve them, then it's time to change the tool.
Yes and Yes, But we just have incredible performance issues and bugs Now let's get back to being more constructive:
That is why I suggest to follow the steps:
Some steps can be split into several, I am fine with it, I would also like to add - I've spent quite a bit of time on a lot of tools and parsers in the postcss ecosystem, and I'm honestly tired, and perhaps this is my last attempt to somehow consolidate all this, if it fails again, I will be upset too much again, ultimately, this will lead to the fact that we will simply lose most of our community in the near future |
@alexander-akait What a big challenge! 👍🏼 👍🏼 👍🏼 I totally agree with the JS solution against Rust since there is a big JS/CSS community here. Additionally, I agree with starting with a CSS tokenizer and value/at-rule/selector/etc parsers. We will be able to try them in the Stylelint codebase easily. |
Yeah, the performance issue is absolutely clear, I know it very well :) LightningCSS for example is (on the surface) a combo of :
Even when being so much faster, people aren't really that interested, they think it is very cool, but very few are switching to it.
I can understand this, and I feel this too, but this is also exactly why I am hesitant. How can we do a project like this sustainably?
The tokenizer is not something we have to start all over right? https://github.com/csstools/postcss-plugins/tree/main/packages/css-tokenizer#readme |
Yes, this is really a headache for us. 😓
Personally, I think |
Would https://github.com/servo/rust-cssparser be suitable to integrate? It's the CSS parser that Firefox uses. Thought its docs do indicate it does not parse into selectors or properties, so it's probably only half a parser. |
CSS preprocessors are on their way out with CSS now having variables, nesting and color modification. I see no compelling reason anymore to use them. |
It's interesting. But I believe our community may be hard to maintain the Rust code.
I think it's important to keep backward compatibility and extendability for CSS-like syntaxes (Sass/Less etc.) because there are big communities already. At least, we should allow anyone to extend and customize our new parser for such syntaxes. |
One way of supporting preprocessors would be to transpile the Sass/Less code with source maps to CSS, lint the CSS, and then report back the errors with the position obtained through the source map. Maybe this is already how it works with the existing |
I personally don't think it's a good idea to rely on using source maps. I think autocorrection breaks syntax in most cases. |
Right, |
Do we have a flamegraph of |
https://github.com/stylelint/stylelint/blob/main/lib/rules/color-named/index.js#L63-L128
It is eagerly parsing with declaration values with It is then walking the value AST and again eagerly parsing with We also have a color value parser built on top of our tokenizer and parser algorithms : The input to this specialized parser is not a string but component values. As many logic as possible can be done first at the token level, than at component values and only when really needed as fully parsed color values. Each step only does the minimal amount of work. |
I am fine with it. My suggestions are:
Maybe I missed something else but this is not a problem, we can discuss it in the repository if we can all agree |
I don't have ownership, admin or publish permissions for either the github org or the npm org for It might be better to do a clean slate start. I personally prefer to work in a mono repo because that makes it easier to spot regressions. I agree on all points of feedback related to the current tokenizer. |
@alexander-akait @romainmenke If you wish, providing repositories for parsers etc. under the @stylelint/owners Any thoughts? |
This is something to be aware of, historically Stylelint has had difficulty in attracting contributors at various times, it's been at times quite challenging allowing both Stylelint to be extended by other plugins and Stylelint depending on other packages and having this ecosystem maintained Another consideration is the eslint/rfcs#99
I've not fully thought through all of this, though if writing new tokenizer/parser and having ESLint under the hood to simplify & streamline the maintenance of the underlying cli and api aspects of Stylelint is worth thinking about also IMHO |
@ntwb Thanks for the comment. As you mentioned, Stylelint has needed more maintainers. I personally think this @alexander-akait's suggestion is great not only for the Stylelint community but also for other JS/CSS communities. However, unfortunately, supporting the challenge under the Stylelint organization may be risky because of that maintainer shortage. 😓 |
Yes, of course, tokenizer/parser/traverser/serializer, these are things related to the parser process, so it would be great to have them all in one place.
It's so funny, because I offered to do this 5 years ago, when we were just starting work, but was refused everywhere, now it's official. And I proceeded from a simple thing - we should make the core for any linters. CLI logic/rules logic/configuration(s)/ignoring and extending/options for parsers and rules/fixable logic/etc and we had to duplicate all this. And my logic was that we could avoid this, collaborate and combine the work, and now I see how it all came to this. But unfortunately a little late and our code has become more complicated and now it would be quite difficult to rewrite all this (yeah, we can just create a rule and run stylelint inside that rule, but that looks like a big mono and badly configurable rule). But now we can avoid some mistakes too JS has https://github.com/estree/estree, so any parsers which follow estree are compatibility and I think we have to do the same, yes it takes a time and I definitely can't do it alone, BUT if we do this, then we will become independent of the parser and its implementation in the future, Rust/JS/Zip/C++/C, whatever you want. I still think that the idea of rewriting everything in Rust is a utopia at this moment (the future is foggy and we do not know what will happen tomorrow, but we can influence it), yes it would be great and it would allow for us to have good perf and many and many, but if we look at the world realistically, we will understand that, unfortunately, there are not so many people who know it, and most our users know only JS (some TS too). But this does not mean that we should not build the right foundation, if we get to this in time, then it will be fine, but for now we can just agree on some documents for AST structures and maybe basic API. |
Hey I just want to introduce myself. I'm working on a shared parser/linter/formatter core, and it is my explicit goal (and full-time job) to unify what can be unified across this ecosystem. I believe myself to be several (important) steps ahead of ESLint in this regard, and as they have also shown me nothing but indifference it seems that I am their open competitor. My project is still flying under the radar for the moment, but I plan for that to change in a major way, and soon. |
Might be an interesting read : https://railsatscale.com//2023-06-12-rewriting-the-ruby-parser/ |
Thanks for sharing the article. I read it. We wish "Universal Parser" for CSS, too! |
The best CSS parser ought the be the one that browsers use. I wonder if Blink's CSS parser could be leveraged 😆. |
Yes and no :) They are the best because they are extremely well tested and are used in the wild by billions. But browsers only need to parse CSS for a limited use case. Those parsers also don't have to support non-standard syntax like scss, less, .... LightingCSS for example uses Servo's CSS tokenizer/parser and that is what makes it good and extremely fast. But it's also the source of all the limitations of LightingCSS. LightingCSS can not be used to build a linter because it discards too many tokens. |
This is where I come in! cst-tokens takes the output of an existing parser and uses it to rebuild a tree in which every source character is present in the token stream. Doing this requires defining the syntax of CSS in a cst-tokens parser grammar, but the parser need not be complete: it does not need to know how to resolve ambiguity. The traversal code simply uses the output of the first-pass parser for that purpose. In this way my project's functionality is closely related to that of ungrammar (which you should also look into though I am focused on extensible grammars and they are not). The cst-tokens CST is also a pure superset of the AST it decorates, and is meant to have all the APIs needed to build any kind of parser, formatter, and linter functionality. It allows comment attachment rules for ambiguous comments to be well-defined, while always preserving the ability to see all possible comment attachments for any given node. |
Another reason there's a strong case for a concrete syntax wrapper around an existing AST is that you don't really have to risk breaking anything!! You use the same parser -- you're just adding a new validator and retokenizer layer, so for your users AND your lint rules the language is guaranteed not to have changed at all! The downside is that the technology isn't ready for production usage yet, and won't be for a little while. Serious users will want to see the library hit 1.0.0, a goal which I've ensured that I can reach and am working directly towards. I'm essentially here asking for help doing the work that makes everything I am describing possible. With the right help I could get to 1.0.0 a lot faster! |
I think it's important to find a place for this effort so that we can split this thread. I don't want to engage too much on specifics but I also don't want to appear dismissive of people reaching out like @conartist6 . I think many people care about this issue and want to collaborate. Maybe any new repository is fine? A place where we can align on priorities, goals, ... |
I can provide a new repository in the github.com/stylelint org, which would be a temporary home for our collaboration. It also would work until we would find a more appropriate home (org). For example, how about |
I'm also interested in this project, and as my time allows, I am happy to help with the planning / implementation. Are you planning to create a Discord server or similar communication platform? |
I like the idea of CST, but unfortunately the use of generic solutions is often much worse in performance due overhead (but I would look at the benches), original CSS tokens (from the syntax spec) already have everything - whitespaces/tokens/etc. Also it is good to be align with it for maintance purposes. If someone wants to start that would be great, I'm a little busy right now. And yes anyway we need to start with the tokenizer and we already have a solutions (we can reuse them). |
@alexander-akait @romainmenke Please freely use it. Since the repo may be temporary, you don't need to follow the Stylelint organization rules.
@scripthunter7 |
Thank you @ybiquitous, I will try to get the ball rolling in a few issues in the next few weeks. |
I recently saw @keithamus's csslex, maybe it is something to consider using. |
Thank you for sharing this @silverwind I've started a list of tokenizers here : #1 |
@romainmenke You can transfer this issue to stylelint/css-parser if you wish it. Of course, no problem with as-is. 👍🏼 |
I'm still working on my solution. It won't be fast in the way Rust is zoom-zoom close-to-the-metal fast, but it will be incremental, streaming, extensible, and easy to maintain -- properties that should prove highly advantageous to linters. Right now I'm working on defining an XML-based serialization format that allows my disambiguated trees to be easily sent over a wire. It's a fun example to check out because it both defines the syntax and shows how the parser core works to define syntaxes. https://gist.github.com/conartist6/5adbbf28d11497467848f530756c1c2a |
As for the zoom-zoom part, making that method of defining syntax fast is mostly just a matter of doing some code transformation. For example if you have a production like this: export const productions = {
*Identifier() {
yield eat(tok`Identifier`);
}
} There's a bunch of associated cost from evaluating const hoisted_1 = eat(tok`Identifier`);
export const productions = {
*Identifier() {
yield hoisted_1;
}
} Now you can see that there's actually a pretty small amount of logic necessary to process any given production! |
What you gain for your effort is the ability to process chunked streams. You don't need to have the entire source in a single stream, as many parsers require so that they can store indexes into the string as state. For a linter this means gaining the ability to lint files larger than fit in memory. Memory usage would be driven more by the complexity of language and query rules than by the size of the file being linted. |
Also tokens that index into strings tend to perform badly when you want to insert a token. The structure requires invalidating all other tokens because the indexes of all tokens after the change will need to be updated by some offset. |
related: biomejs/biome#268 |
Just idea for future and future discussions, maybe we can union and write full featured CSS parser + at-rules/values parser from scratch, I am afraid we can't rewrite postcss due some specific logic (and it will probably take longer), so union around CSS parser will be great for any JS tooling, we can open an issue for this
Shorty about situation:
postcss
and we have certain problems/issues, which, unfortunately, have not been resolved for a long time, like CSS compliance tokenizer and parser, selectors, at-rules and value parsercsstree
parser, but it is pretty slow in solving problemslightningcss
andswc
, but unfortunately they are not quite extensible to support all syntaxes, but probably this is solvable, so it's just a discussion for nowcsstools
with own value and at-rules parserpostcss-value-parser
,postcss-values-parser
andpostcss-selector-parser
, all of them have rather serious limitations and are not so actively maintained, although they are soling almost all current problems, but when a new syntax appears it is usually a problem, another big problem os postcss design, we need to reparse selectors, values and at-rules in each rule, it is very bad for perfomance (very)ListOfComponentValues
Feel free to feedback
I decided to start the problem here, as I think this is the most appropriate place, in the future we may move it or break it into more detailed parts.
The text was updated successfully, but these errors were encountered: