Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[path-] fix undercounted progress for multibyte chars #2323

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

midichef
Copy link
Contributor

@midichef midichef commented Feb 19, 2024

For text files encoded with more than one byte per character, FileProgress undercounts loading progress.

To demonstrate, you can use a UTF-32 file, where every character takes 4 bytes:

seq 1000001 | iconv -t UTF-32 >! progress.utf32.tsv
vd --encoding=utf-32 progress.utf32.tsv

The progress only goes up to 25%, not 100%.

That's because read() progress is counting the characters, but the goal is measured in bytes. The file is around 7 million characters long, but when encoded in UTF-32, it is 28 million bytes, so even at the end, 7 million/28 million becomes 25%.

This PR changes FileProgress to track progress as bytes.

visidata/path.py Outdated Show resolved Hide resolved
@midichef midichef force-pushed the progress_undercount branch from 2a45fd5 to 612fea8 Compare March 4, 2024 02:57
@midichef midichef force-pushed the progress_undercount branch from 612fea8 to ecc8628 Compare March 4, 2024 03:30
@midichef
Copy link
Contributor Author

midichef commented Mar 4, 2024

ecc8628 estimates character length and uses that to estimate progress. It fixes the progress for UTF-16/UTF-32 as discussed. It also fixes it for UTF-8. For example, with a sample UTF-8 dataset of mostly Thai characters, loading progress was too slow by a factor of 2.5. Progress would max out at 40% instead of 100%.

The progress estimator samples characters every so often, to estimate the average bytes per character. Right now it samples more characters early in the file. That's because it needs more samples to come up with a decent estimate of byte length early.

It doesn't slow down code much. For a UTF-8 tsv file with 10 million short lines (seq 10000000), it's within 1% of the running time of the develop branch. For a 2.7GB UTF-32 tsv file with 6 million rows, this change increases mean loading time by approximately 4%. That's higher than I expect. I would like to bring it down to below 1%. I will look into it more, but not soon. So this is ready for review.

@midichef
Copy link
Contributor Author

midichef commented Mar 4, 2024

A related bug in v3.0.2. Progress on compressed textfiles was overcounted. It would pass 100% and go to 200-600%.
338b250 fixes it.
To reproduce: seq 10000000 |gzip -c >! repro_gz_progress.gz; vd repro_gz_progress.gz
The progress meter will pass 100% around the 5,000,000th row.

@saulpw
Copy link
Owner

saulpw commented May 16, 2024

Thanks for doing the work on this, @midichef! I put together #2407 which might be a simpler way to do a few things. I'm not super-excited about some things in my PR, but between the two let's find the common ground.

and let's take 338b250 regardless.

anjakefala pushed a commit that referenced this pull request May 18, 2024
anjakefala pushed a commit that referenced this pull request May 18, 2024
@anjakefala
Copy link
Collaborator

Hi @midichef! Could you review the most recent commit? We removed the batching to simplify the code a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants