Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brainstorm ways to shrink RPM metadata #399

Open
dralley opened this issue Nov 13, 2023 · 5 comments
Open

Brainstorm ways to shrink RPM metadata #399

dralley opened this issue Nov 13, 2023 · 5 comments

Comments

@dralley
Copy link
Contributor

dralley commented Nov 13, 2023

#395 and other recent PRs have brought up the topic of shrinking RPM metadata once again.

I'm not thrilled with such approaches (I can live with it, but it's yak-shaving over just a few percent)

Therefore I'd like to have a discussion about potentially more meaningful approaches.

This ancient wiki page basically suggests specifically excluding icons and documentation entries e.g. /usr/share/doc, /usr/share/icons from filelists.xml, given that they make up a huge proportion of the entries there, and in practice likely should never be used as dependencies.

The data is compelling (but from 2010, so recomputing it would be useful)

2.4 million files total in pkgs in rawhide
2.3 million of those are in /usr
1.8 million of those are /usr/share
Top 3 dirs by file count under /usr/share:
533046 /usr/share/doc
120555 /usr/share/javadoc
105591 /usr/share/icons
45 file-requires requiring something in /usr/share
none of those file-requires are in the top 3 /usr/share dirs
- most of them are fonts.

This 6 year old discussion brings up the same point:

AIUI @james-antill did some analysis versus Debian and he concluded that the "file dependencies" were a major part of the wire size. And yes holy cow, I just looked at a filelists.xml. I think my vote there would be to only do file entries for "entrypoints" like /usr/bin - there's really no sane scenario where an RPM package should Require: /usr/share/doc/GeographicLib-doc/html/C/annotated.html or whatever.

And makes a second suggestion also:

One idea I had is to "presolve" - a lot of this data is completely redundant dependencies. Take this chunk from the very first package I looked at, 0ad:

<rpm:requires>
  <rpm:entry name="libstdc++.so.6()(64bit)"/>
  <rpm:entry name="libstdc++.so.6(CXXABI_1.3)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(CXXABI_1.3.5)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(CXXABI_1.3.8)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(CXXABI_1.3.9)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.11)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.14)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.15)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.18)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.19)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.20)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.21)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.9)(64bit)"/>

...

But those are all provides of the libstdc++ package - and I don't think we're ever going to have different symbol versions provided by separate packages.

So doing a pass where we just drop redundant requires would probably make a notable difference.

@dralley
Copy link
Contributor Author

dralley commented Nov 13, 2023

@Conan-Kudo I know you had strong feelings on this a few years ago, what are your thoughts?

I know that lazy filelists downloading makes the subject less relevant for Fedora 40+, but if there's an obvious win here we should still take it.

@m-blaha
Copy link
Member

m-blaha commented Nov 13, 2023

According to fedora packaging guidlines, files outside of /usr/bin and /etc should not be used as requirements anyway, and files from /usr/bin and /etc are already part of the primary metadata. I'd be really happy if depsolving did not need filelists. Never. Actually, there are currently very few packages (in Fedora, not sure how the situation is in third party repos) that depends on such files, and lately issues have been filed for them to drop such dependencies.

My only occasional use-case for filelists is "Which package provides this file?" (dnf provides /this/file/i/need), and for this reason I would prefer filelists.xml contained all the files.

@Conan-Kudo
Copy link
Member

Third party repos tend to rely on file dependencies more because RPM distributions do not agree on packaging conventions. Fedora packaging guidelines should be ignored from an upstream RPM stack perspective (createrepo_c, dnf, etc.).

@Conan-Kudo
Copy link
Member

I know that lazy filelists downloading makes the subject less relevant for Fedora 40+, but if there's an obvious win here we should still take it.

Nobody has yet implemented lazy downloading. I've asked for this to be considered and provided a conceptual path to doing so, but nobody has responded to my comments about it.

@j-mracek
Copy link
Contributor

I know that lazy filelists downloading makes the subject less relevant for Fedora 40+, but if there's an obvious win here we should still take it.

Nobody has yet implemented lazy downloading. I've asked for this to be considered and provided a conceptual path to doing so, but nobody has responded to my comments about it.

I've summarized information about implementation of lazy loading of filelists in rpm-software-management/dnf5#1053.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants