Threaded FS IO probably should be async now? #4159

rami3l · 2025-01-17T10:10:32Z

Being able to use async where relevant across the codebase makes it possible to use async constructs where relevant. rustup has many places where that can be useful: downloading channel metadata, distribution content, unpacking that content to disk, are work that could usefully proceed in parallel - and async provides a good abstraction for that.

We already have a sophisticated disk IO layer that accomodates various OS latency-inducing behaviours, and adapting that to async without any regressions could be very interesting too - but for now, it co-exists nicely with an async core.
#3367

We are still using an async-unaware threadpool here, and it's known to produce problems (#3125):

rustup/src/diskio/threaded.rs

Lines 1 to 6 in fa4ae32

    
           /// Threaded IO model: A pool of threads is used so that syscall latencies 
        
           /// due to (nonexhaustive list) Network file systems, virus scanners, and 
        
           /// operating system design, do not cause rustup to be significantly slower 
        
           /// than desired. In particular the docs workload with 20K files requires 
        
           /// very low latency per file, which even a few ms per syscall per file 
        
           /// will cause minutes of wall clock time.

Anyway, using RUSTUP_IO_THREADS=1 to limit concurrency feels a bit off. Will migrating to async help?

The text was updated successfully, but these errors were encountered:

djc · 2025-01-17T11:33:51Z

IMO the code we have for this looks very complicated/scary. Rewriting it looks like a project, would probably be hard to know if we bring over all the desired properties. async could be helpful but won't make this easy by itself IMO.

rami3l · 2025-01-17T11:54:08Z

Ah, I do remember the days when installing Rust took a long time to finish: #1876

(cc the original author, if I'm not mistaken, @rbtcollins)

rbtcollins · 2025-01-20T13:09:49Z

Yah, so the underlying thing here is that rustup does tasks that look very much like offensive action to heuristic detectors that are wired into the syscall path on windows; and it also does a large number of operations that have data dependencies. And we have to run in very resource constrained environments, but also unpack very large files.

For the first case, more details:

we write executable files, which includes HTML for docs
these get scanned at CloseHandle time by a defender filter driver except when the non-user-settable scan-on-open setting is set. Microsoft have opted rustup into that setting, though it might pay to check its still active - I don't install Rust on windows every day :).
even when inline scanning isn't happening, CloseHandle latency is much higher than close latency because of the differing IO models in Windows (process owns dirty pages) vs Linux / Un*x (pagecache owns dirty pages).

So for this case what we do is quite easy: we make sure that CloseHandle is in a thread of its own and continue with other IO while that takes place. tokio's io threads would be entirely suitable for this.

Second case:

Creating directories is not free, particularly when RUSTUP_HOME is on NFS (note that that bug was closed with a workaround, but we continued to improve the IO code so that it is genuinely fixed today, even with docs being installed).
Rust distributions have hundreds or thousands of directories (docs alone is 20K files with not many per directory, back in 2019 or so). Files can't be written to directories until the directory exists, which creates a data dependency.

Most files are very small.

To unpack efficiently when mkdir is not instant, we want to decompress as many files as will fit in RAM, and write them to disk as soon as the directory they belong in is created. We don't want to create or assert directory existence more than once per directory. Just waiting for the directory to be created leads to stalls where rustup does nothing and excessive wall clock times.

tar files are not required to be topologically ordered: there is no guarantee that the file path 'a/b' will be preceeded by a directory 'a'. Well behaved compressors will do this of course, but malicious tars can also mess with things by violating this expectation, and if I recall correctly actually old rust artifacts fail this too. Most extractors - and rustup does this - end up taking a pragmatic approach and just implicitly creating directories when e.g. a/b is encountered. We need to guard against malicious tars in case a privileged user runs rustup. For this we use the at style of syscalls (not yet on Windows from memory - that would be a nice improvement and would shave some kernel path processing time).

third case:

Raspberry PI machines have less memory than the llvm lib that Rust ships, which is 500+ MB in size. To process that in memory as one chunk we would need to stream in the compressed archive, and stream out the output, which works completely fine - but runs into the aforementioned performance issues. However, Raspberry PI machines don't run Windows (case 1), and are typically local developer setups, so not (case 2). Thus on Raspberry PI accepting the pipeline stalls and not using the code that addresses case 1 or case 2 makes a lot of sense.

We do have a memory-capped buffer : we have a few different sizes of IO buffer, and store content to be written in an IO buffer for later dispatch. When there are no buffers available (e.g. because the sum of IO buffers exceeds the supplied or inferred size limit, then we stop decompressing the tar and wait for an IO buffer to be available.

memory pressure

Our minimum footprint is then:

the rustup binary
various cached metadata like the configuration
the list of work we want to do (updates/removals etc)
the allocated IO buffers
the work queue of IO pending dispatch once a directory is created or a file is opened
the cache of created directories used to avoid double-handling directories
the decompression buffer for the archive being installed. Note that this can be immense : the larger the window the compressor looks-back-on the larger the look-back capability the decompressor needs. See Memory usage in https://linux.die.net/man/1/xz.

The only one of those that can be tuned is the IO buffer total size. The heuristic we use is actual physical memory - a guess that we took.

I think it entirely likely that decompression memory is why we keep seeing issues here :- I think using a minimal-compression-window archive and including the maximum memory footprint of the decompressor in the memory accounting layer is the most realistic way to make rPi installation troublefree + fast. Alternatively, defaulting to single-threaded non-pipelined logic for rPi would also mitigate most of the problem by never buffering file content.

Much of this is also discussed in this talk I gave.

rami3l added the enhancement label Jan 17, 2025

rami3l added this to the On Deck milestone Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threaded FS IO probably should be async now? #4159

Threaded FS IO probably should be async now? #4159

rami3l commented Jan 17, 2025 •

edited

Loading

djc commented Jan 17, 2025

rami3l commented Jan 17, 2025 •

edited

Loading

rbtcollins commented Jan 20, 2025 •

edited

Loading

Threaded FS IO probably should be async now? #4159

Threaded FS IO probably should be async now? #4159

Comments

rami3l commented Jan 17, 2025 • edited Loading

djc commented Jan 17, 2025

rami3l commented Jan 17, 2025 • edited Loading

rbtcollins commented Jan 20, 2025 • edited Loading

For the first case, more details:

Second case:

third case:

memory pressure

rami3l commented Jan 17, 2025 •

edited

Loading

rami3l commented Jan 17, 2025 •

edited

Loading

rbtcollins commented Jan 20, 2025 •

edited

Loading