Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make --phar reproducible #114

Open
sebastianbergmann opened this issue Dec 7, 2023 · 34 comments
Open

Make --phar reproducible #114

sebastianbergmann opened this issue Dec 7, 2023 · 34 comments

Comments

@sebastianbergmann
Copy link
Contributor

The only thing that was missing for the build of PHPUnit's PHAR to become reproducible was to set the timestamps of the files in the PHAR to a consistent date.

I have implemented that in sebastianbergmann/phpunit@4627215 using seld/phar-utils.

However, I do not like that the PHAR is now changed after it has been created using PHPAB.

It would be nice if PHPAB would do this while it creates the PHAR. What do you think, Arne?

@sebastianbergmann
Copy link
Contributor Author

CC @drupol

@theseer
Copy link
Owner

theseer commented Dec 7, 2023

Interesting idea. I'll have a look at it.

I was hoping to not having to touch phpab anymore ;-)

@drupol
Copy link

drupol commented Dec 7, 2023

Once again, my fault ... Sorry 😖

@theseer
Copy link
Owner

theseer commented Dec 7, 2023

No worries. It's a very reasonable request :)

@theseer
Copy link
Owner

theseer commented Dec 8, 2023

I just tried to implement this - and failed.

The current implementation uses iterators to add the files into the phar. Should be fairly straight forward to hook into this.

The docs on phar::buildFromIterator state an SplFileInfo should be returned. So I implemented my own wrapping IteratorIterator that is (verifiably) constructing my custom SplFileInfo for each entry.

But none of my overwritten methods are ever called.

I do not currently see any means to control the meta data in the phar at build time. The cli tool mentioned does a binary patch of the raw data. It's not actually using the phar API.

I'm not sure what to do here atm..

@theseer
Copy link
Owner

theseer commented Dec 8, 2023

Okay, reading the php-src (https://github.com/php/php-src/blob/master/ext/phar/phar_object.c) helped shedding some light.

If I understand the inline docs for ::buildFromIterator correctly, ext/phar is always, no matter what, using the info from the actual file being added. I have no idea why there is support for SplFileInfo at all, as nothing is actually used from it.

There are basically two ways - despite the many different APIs - to add things into the phar: By having ext/phar get pretty much everything from the actual source file, or by using addFromString - which does not provide any means to set timestamps.

So, looks like for the time being, there is nothing I can do here - at least not by using the ext/phar-API: The time stamp has to be either "correct" before the file gets added into the phar or we have to patch the created phar file.

I could, in theory, include Jordi's library and introduce this as a post-processing step. But that's not much different from what you do now already.

@sebastianbergmann
Copy link
Contributor Author

Thank you for looking into this.

@sebastianbergmann
Copy link
Contributor Author

Maybe @nielsdos, who has done a bit of work on ext/phar recently, can help? It would be great if starting with PHP 8.4 it would be possible to set a timestamp for all files that are added to a PHAR using ext/phar.

@nielsdos
Copy link

@sebastianbergmann how should the API look like? Something like setTimestamp(string $filename, int|DateTimeInterface $dateTime) ? That seems doable.

@sebastianbergmann
Copy link
Contributor Author

Thank you for your reply, Niels. To be honest, I have not really thought about the API for this yet. The first thing that came to mind was an optional DateTimeImmutable argument for Phar::buildFromIterator() to set the timestamp for all files that are added to that timestamp. Or a Phar::setTimestampForAllFiles(DateTimeImmuable) method.

What do you think, Arne?

CC @Seldaek as this could, eventually, hopefully, at some point in the future, make Seld\PharUtils\Timestamps superfluous.

@theseer
Copy link
Owner

theseer commented Apr 30, 2024

I'm not yet convinced that having a specific API for timestamps on ext/phar would be the best way to approach this, as that's rather limiting. Technically, there can be a lot of properties on a file or directory additionally to "only" the timestamp (which one actually? mtime? atime? ...?).

The most obvious thing for me to control meta data was to implement my own SplFileInfo using my own Iterator. It's rather unexpected that none of the methods of my object instance are called. Maybe that could be changed?

If we want to enhance the ext/phar API, we could consider adding an (optional) SplFileInfo to addFromString to control meta data as there currently is no way at all to specify properties for the respective entry.

@drupol
Copy link

drupol commented Apr 30, 2024

Hello,

Glad to see this conversation happening, as the outcome of this is a definitive improvement for reproducible PHARs.

Adding @theofidry which might be interested to participate.

@nielsdos
Copy link

What metadata besides timestamps do you want to control that you can't control now? Uid and gid come to my mind.

The most obvious thing for me to control meta data was to implement my own SplFileInfo using my own Iterator. It's rather unexpected that none of the methods of my object instance are called. Maybe that could be changed?

I think we probably should make this work. If you still have the reproduction test case please share it. Also, this is imo orthogonal to improving the API for Phar to allow overriding metadata.

If we want to enhance the ext/phar API, we could consider adding an (optional) SplFileInfo to addFromString to control meta data as there currently is no way at all to specify properties for the respective entry.

Not a fan of this particular idea. SplFileInfo is tied to an actual file normally, so that seems like abusing the class for the purpose of holding metadata.

@theofidry
Copy link

theofidry commented Apr 30, 2024

Hi!

What metadata besides timestamps do you want to control that you can't control now? Uid and gid come to my mind.

I do not know if there is other metadata that needs to be manipulated.

From a reproducibility perspective the only thing I'm missing is rather clarification of what can cause a PHAR give different signatures.

We had the case with PHPStan on several occasions that the content the PHAR is built looked identical but for the signature. We tried to look deeper into it but besides the signature, the content of the PHAR looked identical. The only difference I could note was when looking at the hexes, the order of the files looked different.

So to me it would be great to know exactly what defines the signature and, provided built the PHAR from the exact same source and the timestamp and algorithm are set, what could still make it that the built PHARs are not identical.

I assume the system filesystem could be a factor, or LC_COLLATE, or the compression algorithm that may not be deterministic? Would be great to have more information there.

A bit unrelated so I don't want to expend to much on it here, but I'm trying to compile some improvements I would like to see in the PHAR extension in box-project/box#1154 (it is entirely unpolished and probably missing some stuff still).

Back to the topic though, I would be more than content with Phar::setTimestamps(DateTimeImmutable): void, but it needs to respect/account for the signing algorithm. The problem is already solved in user-land with https://github.com/Seldaek/phar-utils (which has already been mentioned). The entire code in Box for it currently looks like this:

// src/Box.php

public function sign(
    SigningAlgorithm $signingAlgorithm,
    ?DateTimeImmutable $timestamp = null,
): void {
    if (null === $timestamp) {
        $this->phar->setSignatureAlgorithm($signingAlgorithm->value);

        return;
    }

    // Ensure all the changes done to the PHAR are "committed" before we manipulate the file
    $phar = $this->phar;
    $phar->__destruct();
    unset($this->phar);

    $util = new \Seld\PharUtils\Timestamps($this->pharFilePath);
    $util->updateTimestamps($timestamp);
    $util->save(
        $this->pharFilePath,
        $signingAlgorithm->value,
    );

    $this->phar = new Phar($this->pharFilePath);
}

@drupol
Copy link

drupol commented May 1, 2024

I do not know if there is other metadata that needs to be manipulated.

File and directory permissions can be manipulated.

We had the case with PHPStan on several occasions that the content the PHAR is built looked identical but for the signature. We tried to look deeper into it but besides the signature, the content of the PHAR looked identical. The only difference I could note was when looking at the hexes, the order of the files looked different.

That would be also amazing if we could fix that at the same time.

@theseer
Copy link
Owner

theseer commented May 1, 2024

What metadata besides timestamps do you want to control that you can't control now? Uid and gid come to my mind.

I actually don't remember exactly what the phar file adding preserves in terms of metadata. In theory, it could be pretty much everything that belongs to a file.

The most obvious thing for me to control meta data was to implement my own SplFileInfo using my own Iterator. It's rather unexpected that none of the methods of my object instance are called. Maybe that could be changed?

I think we probably should make this work. If you still have the reproduction test case please share it. Also, this is imo orthogonal to improving the API for Phar to allow overriding metadata.

I might still have it as a PoC but I can also create a new one. It wasn't not that much code. :)

If we want to enhance the ext/phar API, we could consider adding an (optional) SplFileInfo to addFromString to control meta data as there currently is no way at all to specify properties for the respective entry.

Not a fan of this particular idea. SplFileInfo is tied to an actual file normally, so that seems like abusing the class for the purpose of holding metadata.

Point taken - but only if you want to add something that is not resembling a file within the phar. Otherwise it would make perfect sense to me. The reason I was suggesting this is we then have a symmetric API. It feels a bit odd to have SplFileInfo when using iterators and something else when adding by string. But looking at the phar API, there are more methods that would need touching to accomplish this. So, maybe, it indeed is not a good idea ;)

The reason I'm not happy with the "generic" setTimestamp approach is that it would only cater the single use case of reproducibility - making the assumption that every entry in the archive should be having an identical timestamp and we want to explicitly ignore any other meta data that may be set within the phar for a file or directory.

So maybe, we need to spent some more time on thinking about what we actually want to accomplish first - not limiting us to the simplest solution that would "merely" make reproducibility work. And then what the best API is. And what to do with regards to BC. (Maybe the answer to the later turns out to be a wrapper "ReproduciblePhar" ;) )

@nielsdos
Copy link

nielsdos commented May 1, 2024

@theofidry

From a reproducibility perspective the only thing I'm missing is rather clarification of what can cause a PHAR give different signatures.

We had the case with PHPStan on several occasions that the content the PHAR is built looked identical but for the signature. We tried to look deeper into it but besides the signature, the content of the PHAR looked identical. The only difference I could note was when looking at the hexes, the order of the files looked different.

Filesystem file order is indeed non-deterministic. Could be fixed probably by using a sorted iterator.

A bit unrelated so I don't want to expend to much on it here, but I'm trying to compile some improvements I would like to see in the PHAR extension in box-project/box#1154 (it is entirely unpolished and probably missing some stuff still).

Is this an RFC you're going to pursue yourself?
And are you able to do the implementation yourself or do you need help?


@theseer

I might still have it as a PoC but I can also create a new one. It wasn't not that much code. :)

Well, feel free to share something and I'll take a look.

So maybe, we need to spent some more time on thinking about what we actually want to accomplish first - not limiting us to the simplest solution that would "merely" make reproducibility work. And then what the best API is. And what to do with regards to BC. (Maybe the answer to the later turns out to be a wrapper "ReproduciblePhar" ;) )

Right, if someone can come up with a proposal I can do the implementation.

@theofidry
Copy link

Filesystem file order is indeed non-deterministic. Could be fixed probably by using a sorted iterator.

Just to clarify: this is not possible at the moment right?

IIRC I tried to use addFile() or addFromString() at some point but either it was not enough or it was much slower than using buildFromDirectory().

Is this an RFC you're going to pursue yourself?
And are you able to do the implementation yourself or do you need help?

0 chance I can implement any of them 😅 At the moment those are just very rough notes on things that came to my mind so they need to be fleshed out a bit more and would also be good to review what would be really helpful with the PHAR building process or PHAR manipulation. Happy to spend more time on it and create an RFC for it if you or someone can actually implement them.

@Seldaek
Copy link

Seldaek commented May 2, 2024

Filesystem file order is indeed non-deterministic. Could be fixed probably by using a sorted iterator.

Having deterministic ordering by default would for sure be nice. In Composer we do build the phar using a sorted iterator, and we do add everything via addFromString so yes it would also be nice to have a way to specify metadata in there (I guess full access to file modes, mtime, or whatever ends up actually written in the archive), then we would not have to use phar-utils to overwrite timestamps after the fact.

I think my ideal API would be something like new params for addFromString (and possibly addFile too to override file properties), like addFromString($path, $contents, $mtime = null, $mode = null). That'd be easy enough to use.

If you're using buildFromDirectory or from iterator, then it's a bit more complex to override a specific file's info, but I think forcing someone to use the addFile/addFromString to do that would be reasonable.

@nielsdos
Copy link

nielsdos commented May 3, 2024

Filesystem file order is indeed non-deterministic. Could be fixed probably by using a sorted iterator.

Just to clarify: this is not possible at the moment right?

You can do this with your own code.

0 chance I can implement any of them 😅 At the moment those are just very rough notes on things that came to my mind so they need to be fleshed out a bit more and would also be good to review what would be really helpful with the PHAR building process or PHAR manipulation. Happy to spend more time on it and create an RFC for it if you or someone can actually implement them.

I can help with the implementation.

Having deterministic ordering by default would for sure be nice.

Operating systems generally don't make a guarantee about filesystem order. You'll have to do this yourself.

I think my ideal API would be something like new params for addFromString (and possibly addFile too to override file properties), like addFromString($path, $contents, $mtime = null, $mode = null). That'd be easy enough to use.

I disagree. If you have to add those parameters to different APIs, it's better to just provide separate APIs to override that metadata where the doesn't care about how a file was added.

@Seldaek
Copy link

Seldaek commented May 3, 2024

I disagree. If you have to add those parameters to different APIs, it's better to just provide separate APIs to override that metadata where the doesn't care about how a file was added.

Good point, it's such a rarely used api tho i don't really mind how it looks i have to say. As long as it works and it's documented anything will be great :)

Operating systems generally don't make a guarantee about filesystem order. You'll have to do this yourself.

I'm aware i just think the phar could sort files at least when creating from iterator and maybe even sort whatever is added by filename but i can see that's probably too opinionated so I can also live with the current situation for sure.

@theofidry
Copy link

You can do this with your own code.

@nielsdos I'll have to double check, but IIRC using this over using buildFromDir() was like x10 slower. I'll bench it once I get back home

@nielsdos
Copy link

nielsdos commented May 4, 2024

@theofidry I have done performance improvements to different extensions of PHP before, if you give me code I can check what the bottleneck is and try to improve it.

@theofidry
Copy link

Cool, will try to check it ASAP

@theseer
Copy link
Owner

theseer commented May 4, 2024

@nielsdos Here's a reproducer for the "no method called on SplFileInfo":

<?php declare(strict_types=1);

class MySplFileInfo extends SplFileInfo {

    public function getPathname(): string {
        echo "[Pathname]";
        return '';
    }

    public function getMTime(): int {
        echo "[MTime]";
        return 0;
    }

    public function getCTime(): int {
        echo "[CTime]";
        return 0;
    }

    public function getATime(): int {
        echo "[ATime]";
        return 0;
    }

}

class MyIterator extends RecursiveDirectoryIterator {

    public function current(): SplFileInfo {
        echo "[ Found: " . parent::current()->getPathname() . " ]\n";
        return new MySplFileInfo( parent::current()->getPathname());
    }
}

$workdir = uniqid('/tmp/');
mkdir($workdir . '/content', recursive: true);
file_put_contents($workdir . '/content/hello.txt', "Hello world.");

$phar = new \Phar($workdir . '/test.phar');
$phar->startBuffering();
$phar->buildFromIterator(
    new RecursiveIteratorIterator(
        new MyIterator($workdir . '/content', FilesystemIterator::SKIP_DOTS)
    ),
    $workdir
);
$phar->stopBuffering();


$result = new \Phar($workdir . '/test.phar', 0, 'test.phar');
var_dump($result['content/hello.txt']);

This produces the following output on my system (PHP 8.3.6):

$ php iterator.php 
[ Found: /tmp/6636b783648e1/content/hello.txt ]
object(PharFileInfo)#3 (2) {
  ["pathName":"SplFileInfo":private]=>
  string(53) "phar:///tmp/6636b783648e1/test.phar/content/hello.txt"
  ["fileName":"SplFileInfo":private]=>
  string(9) "hello.txt"
}

nielsdos added a commit to nielsdos/php-src that referenced this issue May 5, 2024
While it is possible to return a custom SplFileInfo object in the
iterator used by buildFromIterator(), the data is not actually used from
that object, instead the data from the underlying internal structure is
used. This makes it impossible to override some metadata such as the
path name and modification time.

The main motivation comes from two reasons:
- Consistency. We expect our custom methods to be called when having a
  custom object.
- Support reproducibility. This is the original use case as requested in
  [1].

Add support for this by calling the getMTime() and getPathname() methods
if they're overriden by a user class.

[1] theseer/Autoload#114.
@nielsdos
Copy link

nielsdos commented May 5, 2024

@theseer I've made a draft PR that fixes your problem here: php/php-src#14143
It adds support for having custom SplFileInfo objects. Because Phars only have one timestamp, and it represents the modification time, only getMTime() will be called in your example and that will be the timestamp Phar uses for the file. It also supports overriding getPathname() which is likely only useful for directory iterators.
If you can, please try out the PR and check if I have forgotten anything. But I think buildFromIterator() is the only API that actually handles SplFileInfo objects as input unless I'm mistaken.

@theofidry
Copy link

Created the benchmark here. Will publish the results tomorrow off to bed.

I however realise I cannot simply switch to addFromString() in Box as the content changes (e.g. I execute composer dump-autoload) so using this method would require to do some additional FS calls. I guess I would have to use buildFromIterator() instead (for deterministic results).

@theofidry
Copy link

Here are the results:

with PHP version 8.3.1, xdebug ❌, opcache ❌

\PharBuildBench\BuildFromStringBench

    bench...................................I9 - Mo20.496s (±3.04%)

\PharBuildBench\BuildFromDirBench

    bench...................................I9 - Mo128.825ms (±8.87%)

\PharBuildBench\BuildFromStringWithBufferingBench

    bench...................................I9 - Mo18.971s (±0.88%)

@Seldaek
Copy link

Seldaek commented May 7, 2024

Yes composer phar build is slow too due to addFromString but imo that's normal and even stated in the docs:

Note: Phar::addFile(), Phar::addFromString() and Phar::offsetSet() save a new phar archive each time they are called. If performance is a concern, Phar::buildFromDirectory() or Phar::buildFromIterator() should be used instead.

The only way to speed this up would be a new api which buffers and requires explicit save() to commit changes to disk. But imo phar creation isn't a super performance critical path so not sure if it is worth the hassle.

@theofidry
Copy link

theofidry commented May 7, 2024

The only way to speed this up would be a new api which buffers and requires explicit save() to commit changes to disk. But imo phar creation isn't a super performance critical path so not sure if it is worth the hassle.

I thought that was what the start buffering and stop buffering was about...

But imo phar creation isn't a super performance critical path so not sure if it is worth the hassle.

I agree to a degree, feels like a jump from 120ms to 20s is... meh. Not critical, neither high priority for sure though yes.

@nielsdos
Copy link

nielsdos commented May 7, 2024

It's indeed the fact that the phar is flushed that makes it slow. But the surprising detail is that it's not the writing of the file itself, it's the computation of the signature. The signature computation takes 95% of runtime according to perf profiling...
Indeed it is strange though that addFromString does not respect buffering. Maybe it should, but there are probably people that depend on the current behaviour either knowingly or unknowingly, so even though it would be a simple change to make it respect buffering, I am reluctant to do so.
Not sure what to do here.

@Seldaek
Copy link

Seldaek commented May 8, 2024

Ok interesting.. is it not respecting buffering at all? In a quick test here it seems to me like it is not writing to the file until the end when buffering, but it is perhaps updating the signature on every file add? Perhaps that needs fixing so signature is only updated if getSignature is called, or if stopBuffering/__destruct is called?

Maybe it should, but there are probably people that depend on the current behaviour either knowingly or unknowingly

This sounds like really something that should be able to be fixed if it is really broken, sure it might break some odd use case but it's clearly not working as advertised and the users of phar-writing aren't all that many IMO.

@nielsdos
Copy link

nielsdos commented May 8, 2024

Ok interesting.. is it not respecting buffering at all?

Nope, it writes the Phar file, just not to the file path you provided.

In a quick test here it seems to me like it is not writing to the file until the end when buffering

This is actually wrong, a Phar file with signature and contents is written in the /tmp directory when buffering is enabled and using addFromString. Why? I'm not sure why phar_flush still does this but I think it's because some operations internally expect to work on a file. And I think the reason addFromString does not respect buffering is simply an oversight.

Maybe it should, but there are probably people that depend on the current behaviour either knowingly or unknowingly

This sounds like really something that should be able to be fixed if it is really broken, sure it might break some odd use > case but it's clearly not working as advertised and the users of phar-writing aren't all that many IMO.

Perhaps, this would need discussion on the ML and in the worst case an RFC.
It's actually a trivial code change to fix this, it's just adding a check to see if donotflush is set. EDIT: debatable
The problem is also that, in typical PHP fashion, the number of helpful comments in ext/phar can be counted on one hand and the test coverage is not amazing. Therefore, understanding why something was made the way it was, is often hard to understand.

The donotflush was implemented in php/php-src@8de7bd6 and at that time it did actually have the behaviour of not writing a phar at all.

@nielsdos
Copy link

nielsdos commented May 8, 2024

Actually, it turns out almost any operation can create the temporary file. The fact the temporary file is written is actually related to handling compressed files within the phar. This entire performance issue is related to php/php-src#13055

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants