-
Notifications
You must be signed in to change notification settings - Fork 696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meta: Exact-printer Mega-issue #7544
Comments
We have an intermediate "lexed tree" structure in https://github.com/haskell/cabal/blob/1be581d5463adebf8dc8cc8a5375a9ca3051970a/Cabal/src/Distribution/Fields/Field.hs One idea I would throw out is turning that structure into the main structure we get exactprinting working with (making it comment and whitespace preserving). Then transforms on cabal files involve parsing to that, then running the FieldGrammar parser to get the nice fully typed structure, modifying that and emitting it back to Fields, and then calculating the delta to apply the changes thus made to the original exact Fields. |
There's good prior art here: - https://github.com/phadej/cabal-fmt |
See also the old github project about that: https://github.com/haskell/cabal/projects/9 I've updated it with related issues I've found, but then also marked all the issues with a label, so the indexing is available in both forms now. If the project is not going to be used, please remove it. The labeling will remain, preserving the data. |
I combed through cabal-fmt and here, and I must say, I'm not a fan of how the comments would need to be dealt with if we use that approach. Yes, for solving it works great, it cuts down a considerable amount of time, but for an exact-printer it seems... odd (and potentially making it more complicated than it should be). I really do believe that that the right way would be to parse it all at once, as opposed to appending the comments afterwards. |
Consider that the The GHC / cabal format is special, that line comments are always on own line (!!!) so we could intercalate them in BUT, if we want to represent fiield contents as data structure too then we face the same problem as GHC: build-depends:
base
-- minimum is GHC-7.0
>= 4.3
-- and maximum ..
&& <= 4.17 I expect that cabal-exactprint would allow me to change the I don't see how comments can be attached to that. (I suspect that |
Sure @alanz could help us to compare ghc-exactprint with the requirments for cabal |
Indeed, a tad bit more complex AST would be needed to take care of that... shape, but I believe that's about it. And yes, input from @alanz would certainly be appreciated. |
Sorry all, I have been ignoring this issue for a while, my attention being elsewhere. I will take a decent look in the next few days. |
I don't see how the multiple comments thing is an issue. Your annotation functor would of course contain a list of (maybe) comments, in most cases it would be empty, in others it would be a singleton, and in the most complicated case it would be comments, and skips. Of course you can do the same sort of thing in GHC's case, but it becomes complicated enough there to remove the point I think. Even with such information it'll be complicated to get correct - does a comment associate to the line above or below isn't apparent. For example if you delete data, does a comment go with that data or not? My annotation would be something like [Maybe (WhiteSpace, Comment)] (illustratory only of course). |
First quick comment
Prior to GHC 9.2 this is true, from GHC 9.2 they are in the AST, as part of the exact print annotations. But that is just mechanics of data capture. In ghc-exactprint we work with an item having some form of "anchor", which represents the top-left point from which it gets rendered, much like in a normal pretty printer. This means that whatever leading indentation exists at that point gets preserved as we print items contained in the one being printed. We capture comments as being contained within the span of the AST item being printed, or coming after it for top level items. This is good enough for exact-printing an unchanged AST. If the AST needs to be modified prior to printing, we do a comment re-balancing process, where we try to keep comments associated with the AST item they logically belong to, which means a preceding document comment will be kept with the item following it, and comments between things associate to the thing without a gap, so a trailing comment on the same line, or without a line break will associate with the prior one. This is all very heuristic driven, and up for optimization. To be able to handle that case, as the exact-printer finds comments it throws them into a queue, and then prints any occuring between the current print head position and the next output item. (Logically, the actual mechanics are a bit more complex because everything is deltas). And some day I need to capture all of this in a proper document. Perhaps someone should offer to co-author something with me, to hold my feet to the fire of actually documenting it. |
That's the idea why |
Oki, I think we should keep things as simple as possible, as long as it covers our needs, taking into account the inputs from this issue, I think one acceptable shape for our AST would be something along the lines of: data ExactPrint ann
= Field ann String [ExactPrint ann] -- your field (or section) with its children
| Value ann String (Maybe (ExactPrint ann)) -- the fields' values
| Cont ann String (ExactPrint ann) -- this would mold a single value across multiple lines
| Comment ann String -- comment
| Empty ann -- empty lines, and also to stop the recursion in Cont It would allow enough flexibility to depict a cabal-like format in its entirety while being quite expressive and not overly complicated. Thoughts? |
I really really would like to have a As I said, nodes for Because I think that current |
Do we have any other case where a specific treatment would be desired or is it something exclusive to build-depends? |
I think we've let perfect be the enemy of the good far too long. A nice special representation for build-depends (or any other multiline field) can be built on top of any of the proposed structures, and that should be fine. Even without that, we have 90% of the use cases that we really really want this feature for unblocked. ptkato: go for it! |
Is this supposed to be a separate parser or a replacement for the existing parser in cabal now? |
Separate parser with a different goal compared to |
It's a separate parser because As the resulting parser simply parses anything that has a format of |
Note that there is I think a perfectly cromulent and not too difficult proposal above that would retrofit cabal's existing parser with sufficient exact-parsing capabilities. See this message and the discussion prior: I am aware people are interested in pursuing different independent projects and we welcome that. However, this is a good proposal and I have not seen any objection to it other than that it requires some work to get up to speed enough with the cabal parser, which is complex, to be capable of doing this. So I would encourage people to seriously consider it. |
I don't know how well that representation matches reality, but the annotationless one I have isn't all that long either. My variant-- | Context-dependent whitespace.
newtype Offset = Offset Int
deriving newtype Show
-- | Context-independent whitespace.
newtype Whitespace = Whitespace Int
deriving newtype Show
-- | Anything that follows two consecutive hyphens. Lasts until the end of the line.
data Comment = Comment
Whitespace -- ^ Before double hyphens
Whitespace -- ^ Between double hyphens and text
Text
deriving Show
-- | Any Unicode characters, excluding C0 control codes, '{', '}', ':'.
newtype Heading = Heading Text
deriving newtype Show
-- | Any Unicode characters, excluding C0 control codes, '{', '}', ':' and spaces.
newtype Name = Name Text
deriving newtype Show
-- | Field contents at the same line as the declaration.
data Inline = Inline
Text -- ^ Includes preceding whitespace
| NewlineI
deriving Show
-- | Field contents at the lines following the declaration.
data Line = Line
Text -- ^ Includes preceding whitespace
| CommentL Comment
| NewlineL
deriving Show
-- | Curly bracket syntax, together with the preceding empty space
data Curlies a = CommentC Comment (Curlies a)
| NewlineC (Curlies a)
| Curlies
Whitespace -- ^ Before left bracket
a
Whitespace -- ^ Before right bracket
deriving Show
-- | List with an alternative unparsed representation.
data Gradual u a = Entry a (Gradual u a)
| More u
| End
deriving Show
-- | Section contents with the curly bracket alternative.
data Section = CurlS (Curlies (Gradual [Line] Node))
| NormalS
(Maybe Comment) -- ^ Inline comment
(Gradual [Line] Node)
deriving Show
-- | Field contents.
data Contents = Contents Inline [Line]
deriving Show
-- | Field contents with the curly backet alternative.
data Field = CurlF (Curlies Contents)
| NormalF Contents
deriving Show
data Node = Section
Offset
Heading
Section
| Field
Offset
Name
Whitespace -- ^ Between field name and colon
Field
| CommentT Comment
| NewlineT
deriving Show
newtype Layout = Layout (Gradual Lazy.Text Node)
deriving newtype Show The big problem in my case isn't the definition, it's parsing:
All three are solved with lookahead, so the parser gets quite convoluted. |
(random thoughts and snippets) One can parse and annotate each field with its source like this: readFieldsAnnotatedWithSource :: ByteString -> Maybe [Field ByteString]
readFieldsAnnotatedWithSource bs =
either (const Nothing) (Just . snd . L.mapAccumR annotateField maxBoundPos) (readFields bs)
where
annotateField :: Position -> Field Position -> (Position, Field ByteString)
annotateField finishPos = \case
Field (Name pos name) fls -> (pos, Field (Name (getSrcBetween pos finishPos') name) fls')
where
(finishPos', fls') = L.mapAccumR annotateFieldLine finishPos fls
Section (Name pos name) args fs -> (pos, Section (Name (getSrcBetween pos finishPos'') name) args' fs')
where
(finishPos', fs') = L.mapAccumR annotateField finishPos fs
(finishPos'', args') = L.mapAccumR annotateSectionArg finishPos' args
annotateFieldLine :: Position -> FieldLine Position -> (Position, FieldLine ByteString)
annotateFieldLine finishPos (FieldLine pos xs) = (pos, FieldLine (getSrcBetween pos finishPos) xs)
annotateSectionArg :: Position -> SectionArg Position -> (Position, SectionArg ByteString)
annotateSectionArg finishPos = \case
SecArgName pos xs -> (pos, SecArgName (getSrcBetween pos finishPos) xs)
SecArgStr pos xs -> (pos, SecArgStr (getSrcBetween pos finishPos) xs)
SecArgOther pos xs -> (pos, SecArgOther (getSrcBetween pos finishPos) xs)
getSrcBetween :: Position -> Position -> ByteString
getSrcBetween from to = snd $ splitAtPosition from $ fst $ splitAtPosition to bs
maxBoundPos :: Position
maxBoundPos = Position maxBound maxBound Then manipulate
|
The point of the proposed representation I linked is not brevity or not but rather that it "forgets down" to the existing "Field" represntation, suitable for use downstream in the existing parser pipeline. This means that it can be added as an extension to the existing parser, while a whole new AST would likely not be so simple. The other point of that proposed representation is that it parses efficiently an extension to the existing efficient parser. |
Then you (the maintainers) will have to agree on what the scope of this task is and how you see Cabal's parsing working in the future. In my view the current bidirectional all-versions-in-one double-specialization parser is extremely unwieldy (e.g. buildInfoFieldGrammar). I suspect this is, at least in part, the reason for why format changes over the versions have been quite anemic, only adding and deprecating specific fields, never cleaning up. If everyone is satisfied with the status quo and a mere extension of the lexer is deemed the easiest approach, then I'm not the person to solve this as I have no interest in tag soup parsers. |
I think an independent library that can parse, modify, and exactprint cabal files would be quite nice. The concern would be that it might not keep pace with the spec, while doing so in an integrated way would. If your intent remains, as stated above, to write an independent library that does not replace what currently exists, then please go for it! All sorts of independent tools can be written on top of it and integrated with cabal using the external command system, and this could well be a good approach. That said, I do not see a path to replacing the entire existing cabal parser, with its needs for efficiency, back-compat, etc. At this point what we can do is only evolve it. |
The spec doesn't change so very often, I thought. |
While it is a lot of work due to the sheer number of field formats and caveats, I don't see either efficiency or backwards compatibility as issues:
On top of it all of this should be completely pure: downloading all of Hackage, and then checking exactness of the exact parser and errors emitted by the representation one, should be good enough tests for both of them. |
Similarly to what @gbaz says, I welcome and support anybody to work on an independent parsing library. BUT I also doubt of the long term benefits of such efforts, since Please, do dive in the code, try your ideas! IMHO the only way out of the tarpit is through iterative improvements. I know the parsing part could be daunting (from memory I count 4 "levels", the lexer, the parser from tokens to Try to take them a part, try to replace one, why do they exist? what do the do? can we evolve it? which bit first? and how? Many would prefer wqorking on a green field project but Cabal has many years of history and many constraints to satisfy. Here lies the challenge. [1] which is not formally specification elsewhere, perhaps one could start from this first. But we would need a way to determine if such specification matches the current implementation. Any idea is welcome. |
But there's short-term benefits, no? Projects like HLS are waiting for a good exact parser of any kind, and while long-term, it might be nice to retrofit the parser in |
Yes, there are short term benefits. I cannot disagree with that. Exploring the design space is also something that can become a long term benift; but there is also the usual trap of getting 80% of the design right and never finish the remaining 20%. |
It changes slightly (in terms of additions of new fields) with almost every Cabal major release: https://cabal.readthedocs.io/en/stable/file-format-changelog.html |
I would love to collaborate on a project with this type of specification. Something that could start small - I like the idea of just learning a bit about the answers to these questions. An initial aim could be just to get a "rough guide" to cabal syntax, and the library, that could maybe help the next generation of intrepid explorers. I don't know too much about the code base, but I have parsed cabal files using the lbrary and some (flatparse) logic. |
I get the impression this is somewhat demotivating to potential new contributors, so I wish to suggest a path to replacing cabal parser:
Once the maintainers feel confident the exact print parser is good enough in terms of correctness and performance, |
Having thought about it for a bit more, curly bracket syntax forces parsing to be depth-first, as any deeper level may introduce a breaking offset change: foo
bar {
baz
this {
that
}
end } As such streaming is limited to one top-level element at a time, and to prove the element ends at a correct point its entire skeleton must be parsed. For any future format-inventors: I think the sane approach would be to instead use a special literal at the start of the line, only interpreting it as an offset break if its offset is smaller than of the element its inside of:
Same rule would make sense for comments inside fields:
|
I understand how @gbaz's comment might have come across as demotivating. But "evolve it" and "replace it" are not necessarily in opposition. Just like Theseus's ship, we can evolve it by replacing one part at the time (or should I say replace it by evolving one part at the time? 🤔).
I appreciate you have been spending time one this and I will comment on some technical aspects of it; but what I want to understand first is: are you willing to do the work?
This is done, here is a freshly baked new parser written by @VeryMilkyJoe. I also notice that @liamzee has already contributed to it. I might disagree with some of the technical choices but let's just roll with the To have a chance of being able to use it Cabal-syntax, it has to depend only on boot packages so you have to replace megaparsec with parsec.
You are welcome to do this.
Which version you are talking about here? The
At the cost of impoving anything. If Cabal operated on Also, you will realise that The actual fields definition is in the various
Define "output":
(while considering myself only a contributor) Yes please! You can also use the test-suite
Again which legacy AST you are referring to here?
In terms of benchmarks I believe we have only As far as I understand:
Once the problems above have been resolved, I believe nobody will stand in the way. I apologise if I have come across a bit rude. By looking at the codebase, I think these are the problems we have to face. I am eager to see them discussed or better still solved. Of course, I would be also happy to be called out if someone thinks that I am too entrenched in the current architecture and everything can be easily simplied. Show me! |
No the intention was to hire @LemonjamesD to do this, but she has to know what to do.
Yes so I get the impression the best approach is to modify types such as
Any of those would work depending on what you're mapping into,
I perhaps mistakenly assumed the current way of parsing needed replacing.
It's fine you provided a goldmine of information, at least I can now make a sorta scoped contract rather then "oh go do this thing". New approach based on (1):
This approach makes the work easier to split as well because you can just make test by test pass. Where a test is a cabal file being roundtripped. |
Seems like the last idiosyncrasy on my list is section name parsing, allowing
No manifests on Hackage utilize this as a format escape, only ds-kanren/metric patches do. |
What is this issue?
A central place to discuss the issue of a Cabal exact-printer so that we can push forward the work all in one place. Currently, The discussion stretches back across many issues over the past 6 years.
What is it?
An exact-printer, as inspired by
ghc-exactprint
, is a byte-for-byte bidirectional parser and pretty printer, which gaurantees the following constructs exist and are principled:.cabal
file source tree representation, which is in bijection with the package description for a.cabal
file for a given spec version. This representation should be parametrized with all offset and encoding information along with file data..cabal
file entries. I should able to update and transform individual components of data deep in the representation without modifying the broader structure.cabal-testsuite
andcabal install
's test suite for the individual commands.Why do we need it?
This work is tied very closely with the
format
,init
,gen-bounds
and other work as shown in historical discussions:cabal init
. #6187This would free us up for several important quality of life improvements to the
cabal install
ecosystem, as well as general consumers of theCabal
library API. We would like to have this in byCabal-3.8.0.0
. The important bits this enables include:cabal gen-bounds
- will be able to updatebuild-depends
stanzas in a principled manner, so that bounds management will be automatic in the future if the user so desires.cabal format
- we hope that a.cabal
format can have a canonical shape, so we can limit the amount of deviations from the norm, or ad-hoc formatting choices present in the ecosystem.cabal init
becomes much easier to manage, and we no longer have to dedicate redundant types, formatting, and pretty printing to this effort.Who's in charge of this?
I am currently overseeing @ptkato who has been tasked with taking this on. Please consider him and myself a point of contact for this work, and join us in
libera.chat#hackage
to discuss.@ptkato @gbaz @davean
The text was updated successfully, but these errors were encountered: