Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rework the cache that it also provides virtual folders like core, extra, community etc. #11

Closed
RubenKelevra opened this issue Feb 7, 2020 · 13 comments

Comments

@RubenKelevra
Copy link
Owner

This allows a script to cache updates:
#4

@RubenKelevra
Copy link
Owner Author

fixed in 4cd1151

@guysv
Copy link

guysv commented Feb 20, 2020

Hmm, I think there is a missing piece here. The db files should be linked to pkg.pacman.store/arch/x86_64/default/$repo/$repo.db

If I set pkg.pacman.store/arch/x86_64/default/$repo/ as my package mirror and try to perform an upgrade, pacman will try to download the repo db from pkg.pacman.store/arch/x86_64/default/$repo/$repo.db which now 404s

@guysv
Copy link

guysv commented Feb 20, 2020

I know the recommended usage is via FUSE, but I prefer to sync via http (and pay double on disk space) than to mount the internet on my filesystem.

@RubenKelevra
Copy link
Owner Author

Hmm, I think there is a missing piece here. The db files should be linked to pkg.pacman.store/arch/x86_64/default/$repo/$repo.db

If I set pkg.pacman.store/arch/x86_64/default/$repo/ as my package mirror and try to perform an upgrade, pacman will try to download the repo db from pkg.pacman.store/arch/x86_64/default/$repo/$repo.db which now 404s

Yes indeed I'm breaking the conventions of a regular mirror intentionally in this case.

The http gateways are IMHO not meant for downloading binaries of updates from the IPFS.

I just want to avoid that people add the cluster with an IPFS http gateway like ipfs.io to update their servers and publish this like the optimal solution to avoid censorship or do distributed updates, you know?

Additionally there needs to be checks that the db file is not stuck in time, like if the import got stuck, the project died or the server importing it in the cluster having an outage.

Not checking the time the update was refreshed and having IPFS as first server configured could lead to serious security implications, since you won't get security updates in this case.

The dbs are meant to be copied from the ipfs to the local filesystem with something like ipfs cat and a local running IPFS client, after checking the lastupdate file (currently not yet in the folder).

The packages on the other hand can be downloaded by browsing. If you need a specific one, like an older version, you can also download the corresponding database which holds the signature for this package - that's the reason I keep them in the snapshots.

The cache folder is meant to be mounted just as an read-only cache for pacman, which avoids - as you pointed out, that you have to spend the storage on packages twice.

If you like to keep local copies of files you've installed, you can always simply pin them in ipfs by a script.

You just need to fetch the package list with versions and match them to the files in ipfs://pkg.pacman.store/[..]./cache/.

To get a better overview over the files you've pinned I recommend to also copy them in an MFS structure, like

ipfs name resolve pkg.pacman.store

With the CID:

ipfs ls /ipfs/<CID>/arch/x86_64/default/cache/

And then copy the CID of the package you have installed to a local folder:

ipfs files mkdir /pacman-cache-pkg
ipfs files cp /ipfs/<CID> /pacman-cache-pkg

Note that files added with ipfs files cp are lazily 'mounted' and not necessary fully stored locally.

So if you would just run ipfs files cp on something really large, it will still return very fast, while it's beeing fetched only if you access it.

You can circumvent this behavoir by pinning it additionally, and unpinning it afterwards (or the file will be still pinned, when you remove it).

If you feel like having a history of installed packages locally stored in an MFS is a viable solution and more people than you might want to use it, I'm fine with adding this as an option to scripting around pacman.

The idea would be to either hold all, the installed or just the last n versions of a package locally (in an MFS folder) and automatically remount it, after changes has been done, so you can always just cd into it.

@RubenKelevra
Copy link
Owner Author

RubenKelevra commented Feb 21, 2020

If you feel like having a history of installed packages locally stored in an MFS is a viable solution and more people than you might want to use it, I'm fine with adding this as an option to scripting around pacman.

The idea would be to either hold all, the installed or just the last n versions of a package locally (in an MFS folder) and automatically remount it, after changes has been done, so you can always just cd into it.

Addition:

In this case create a new ticket please where you explain exactly what you need and why :)

I know the recommended usage is via FUSE, but I prefer to sync via http (and pay double on disk space) than to mount the internet on my filesystem.

As already explained here the integrated solution for mounting isn't stable. So I won't use it. It's also rather slow.

I don't see any advantage for downloading via http except the 'having a bunch of older versions locally' aspect.

If we can migrate those files just into a MFS which is somewhere conveniently mounted all the time, we can avoid the need to copy data back and forth on the same filesystem and you can also share the files with the network... this will help the cluster with traffic, compared to a non-local http gateway.

@guysv
Copy link

guysv commented Feb 21, 2020

The http gateways are IMHO not meant for downloading binaries of updates from the IPFS.

Wait, why not?

I just want to avoid that people add the cluster with an IPFS http gateway like ipfs.io to update their servers and publish this like the optimal solution to avoid censorship or do distributed updates, you know?

Well, this seems inevitable. Right now any third party can replicate the repo tree, add those db files where needed and perform upgrades from the new tree. Our cluster will still share resources, as the package tarball CIDs are still the same. The only way to counter such "abuse" is to start a private, exclusive IPFS network.

Additionally there needs to be checks that the db file is not stuck in time, like if the import got stuck, the project died or the server importing it in the cluster having an outage.

I don't think this will be a problem if you update where the IPNS name points to only after a finished refresh. This way partial updates are not possible. If the project dies, its users should be aware of that by subscribing to arch's security advisories and noticing patches are not installed.

@guysv
Copy link

guysv commented Feb 21, 2020

I'd also like to comment that while updating via http is less space efficient, it's a simpler solution than tricking pacman to download packages while checking if they exist in the cache, by mounting pacman's cache on the repo tree. using IPFS as a mirror also allows to easily fall back to normal mirrors if the IPFS one underperforms.

Also this is how victorb's arch-on-ipfs project used to work.

@RubenKelevra
Copy link
Owner Author

The http gateways are IMHO not meant for downloading binaries of updates from the IPFS.

Wait, why not?

It's a convenient feature for users, but they offer this service for free while paying for the traffic. Causing a huge amount of traffic for them just to replace traditional static http servers which already do exactly this job isn't fair use IMHO. I will never actively support this type of usage.

It's also just a convenient feature for users to browse the archive and the structure via automatic http redirects to the service, to be able to download specific files and databases off of snapshots manually very easily.

I just want to avoid that people add the cluster with an IPFS http gateway like ipfs.io to update their servers and publish this like the optimal solution to avoid censorship or do distributed updates, you know?

Well, this seems inevitable. Right now any third party can replicate the repo tree, add those db files where needed and perform upgrades from the new tree. Our cluster will still share resources, as the package tarball CIDs are still the same. The only way to counter such "abuse" is to start a private, exclusive IPFS network.

In this case, the cluster data is probably blacklisted in the long term, because it's no longer considered fair use if thousands of machines will download their updates purely from this web service. In those cases, the convenience for the users to browse easily with their browser to this repo is no longer possible.

It's like browsing the Wikipedia mirror via the gateway vs scraping it with an HTTP downloader completely, fetching half a petabyte from the gateway instead of just using IPFS to pin it directly.

As I stated above, I won't help in any way to abuse what I consider fair use of a freely available service.

Additionally there needs to be checks that the db file is not stuck in time, like if the import got stuck, the project died or the server importing it in the cluster having an outage.

I don't think this will be a problem if you update where the IPNS name points to only after a finished refresh.

IPNS name points are constantly refreshed as long as the server is running (this is a necessity, like reproviding content you're offering), but I won't update them.
They have a lifetime of 96h, so if the server has an outage, they will be invalid for ipfs after this time. Note that the cache time is MUCH shorter, with just 5 minutes. So after 5 minutes, the cache has to be refreshed. If a new version is published, the new version will be used.

This way partial updates are not possible.

True, they are not. I only refresh the record to a new version when the sync was successfully completed. Else the old version (by CID) will be still available for at least 2 months in the cluster.

If the project dies, its users should be aware of that by subscribing to arch's security advisories and noticing patches are not installed.

This doesn't change, that there are security implications by using this service. :)

I'd also like to comment that while updating via http is less space efficient, it's a simpler solution than tricking pacman to download packages while checking if they exist in the cache, by mounting pacman's cache on the repo tree. using IPFS as a mirror also allows to easily fall back to normal mirrors if the IPFS one underperforms.

Yes, that's completely true. But the alternative is to just use IPFS with it's local http proxy to do exactly the same thing. Feel free to write a how-to for this. :)

Also this is how victorb's arch-on-ipfs project used to work.

I think @victorb was just offering a proof of concept. But maybe this wasn't just not part of his considerations.

@RubenKelevra
Copy link
Owner Author

I also like to point out, that my approach is not to replace a centralized service with a different one.

Sure, it's CURRENTLY centralized, by just importing to the cluster in one place. But it's not meant that way in the long term. I hope that the things I've said in the original discussion is going to be come true, in the long term:

Every maintainer (trusted peer) can alter the pinset, while the the mirrors spread the content reliably an guaranteed redundancy - without having to hold the whole repositories on each server.

ipfs/notes#84 (comment)

@guysv
Copy link

guysv commented Feb 22, 2020

I think I didn't explain my rationale correctly. I will open a new ticket explaining my situation top-to-bottom.

@hsanjuan
Copy link

The http gateways are IMHO not meant for downloading binaries of updates from the IPFS.

Wait, why not?

I haven't read the whole thread, but just use the local gateway provided by the IPFS daemon on localhost:8080. That would be the workaround to mounting the whole thing locally.

@RubenKelevra
Copy link
Owner Author

@hsanjuan true.

He was locking into running a IPFS cluster follower in his network, while configuring the ipfs.io gateway for his local Arch Linux machines.

Regardless on his actual setup details I will avoid using the ipfs.io gateway for anything in my setup recommendations, since replacing hundreds of update servers with the ipfs.io gateway doesn't work out. Neither for the ipfs project nor for receiving the updates fast and is not decentralized at all.

Getting the current CID of an ipns and mounting it is IMHO still the best solution, since it avoids that you're downloading the stuff to the IPFS cache, reading it again from the disk, pushing it locally thru an http connection and writing it back on the same disk as before in the pacman cache.

@guysv
Copy link

guysv commented Feb 27, 2020

what, no, I was looking at using my local follower's gateway, not ipfs.io's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants