Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Code Quality / Tiding #390

Merged
merged 16 commits into from
Jan 30, 2024
Merged

Improve Code Quality / Tiding #390

merged 16 commits into from
Jan 30, 2024

Conversation

iamcarbon
Copy link
Collaborator

  • Simplifies string.Join calls (eliminates various array allocations)
  • Uses new GetString polyfill
  • Removes Empty class (using empty collection expression instead)
  • Eliminates various byte allocation
  • Updates ShouldAcceptList to use utf-8 bytes
  • Updates StartsWithJpegPreamble to accept ReadOnlySpan
  • Switches on integer values to avoid a string allocation

@iamcarbon
Copy link
Collaborator Author

@drewnoakes Ready for review. This is the last of the changes proposed for 2.9.

Copy link
Owner

@drewnoakes drewnoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Thanks for fixing all these. Left some comments. See what you think and we'll get this merged.

@@ -89,17 +91,24 @@ public static class Iso2022Converter

foreach (var encoding in encodings)
{
char[] charBuffer = ArrayPool<char>.Shared.Rent(encoding.GetMaxCharCount(bytes.Length));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. There are some other places we could use ArrayPool too actually, like in the reader classes. I'll investigate that separately for 2.9.0, unless you'd like to.

Copy link
Collaborator Author

@iamcarbon iamcarbon Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also some wins introducing a few new specialized readers that can operate directly over ReadOnlySpan.

One callout, would be replacing SequentialByteArrayReader with an optimized ref struct {LittleEndian/BigEndian}BufferReader(ReadOnlySpan buffer) -- and spanifying the outer method.

This would allow us to operate directly over a span, and eliminate the reader allocation.

There's another big win eliminating all the temporary array allocations when we make (int tagName, params string[] descriptions) calls.

Copy link
Owner

@drewnoakes drewnoakes Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more use of ArrayPool in #392.

There's definitely room for improvement in the reader classes. Check out the PRs from @kwhopper for some more ideas.

My vague idea here is to build on the new span types and @kwhopper's investigations, and actually avoid doing as much parsing work during the extraction phase. Many directories we store could actually be backed by a byte[] (or Memory<byte>) that could be inspected when enumerating through tags. This would be quite a big change, and requires some research before it could be pursued.

There's another big win eliminating all the temporary array allocations when we make (int tagName, params string[] descriptions) calls.

This sounds promising. We'd need to verify that the compiler doesn't allocate an array behind the scenes.

EDIT: It seems to do the right thing on modern .NET: https://sharplab.io/#v2:EYLgtghglgdgPgAQEwEYCwAoBBmABAlANnyVwGFcBvTXW/PA4hAFlwFkUAKAJQFMIAJgHkYAGwCeAZQAOEGAB4CABgB8+FEoDOAShp1qGOrgC+e2mfrqmrNkk67D+i0Y6cA2gCIAgh4A0uDwAhPwCyDwBdbQBuC1MMYyA===

...but not on .NET Framework: https://sharplab.io/#v2:EYLgHgbALAPgAgJgIwFgBQcDMACOSK4LYDC2A3utlbjngXFNgLJIAUASgKYCGAJgPIA7ADYBPAMoAHboIA8eAAwA+XEgUBnAJSVqFNNWwBfHVRM1V9RkwStt+3WYMtWAbQBEAQTcAabG4BCPn7EbgC6mgDcZsZohkA==

This is a compiler feature, so we'd need to test netstandard2.1 csc output, which sharplab doesn't support afaik.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's definitely room for improvement in the reader classes. Check out the PRs from @kwhopper for some more ideas.

Thanks. RandomAccessStream+ReaderInfo is an experiment somewhat similar to this Span conversion. It goes a bit further by abstracting away all the buffering (RandomAccessStream) and "span-ifying" (ReaderInfo with byte arrays) entirely; callers then only have to worry about one kind of reader. Side effects are the ability to know your exact physical offset at any time, and support for streamed content.

If you can reach those same goals in this process, that's a great addition and should allow new things in the future. I can also check through the code in those old PR's to see if they could use Spans like you're doing here, if that has some value.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kwhopper the recent activity has been small incremental improvements. I see your PR as the kind of thing we want for 3.0. It'll be a lot of work to integrate throughout the code though, so we should noodle out the details with sketches and discussion before starting the work of integration. We only want to do that integration work once.

A direction I think would be good is to divide the parsing into two stages:

  1. Reading the file in coarse chunks. E.g. for JPEG, this could be just pulling out and labelling the segments we need. These would be allocated in contiguous chunks of memory, with no further processing. I think this phase could mostly be done sequentially. The chunks would remember their offsets relative to the start of the file too, along with whatever metadata is needed for later steps.
  2. As the consumer walks through the metadata, we process the chunks of data to produce the tags.

Currently we do both 1 and 2 during the read phase. I'm thinking that, with this, we'd just do step 1 during that phase, and step 2 during the enumeration. This will mean a lot less work and fewer allocations during the first phase, and when that work's done during the second phase, any allocations would be shorter-lived and therefore more likely to be GC'd quickly in gen0. It'd also allow consumers to skip decoding bits they don't actually care about.

I'm hoping to write this up a bit more comprehensively and would really appreciate your input.

Comment on lines +224 to 225
return Encoding.UTF8.GetString(values)
.Trim('\0', ' ', '\r', '\n', '\t');
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we could trim a span rather than a string, but I don't think it's safe to trim these bytes in all UTF8 strings, and that trimming characters is better. It would be possible to use the Encoding to populate a Span<char> and trim that, but I don't think it's worth it.

My plan is to look at some traces and take a data-led approach to the next wave of optimisations. There's too much code to go through.

Copy link
Collaborator Author

@iamcarbon iamcarbon Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a good case for the ArrayPool, where we trim the char[] buffer, before materializing the string. Agree that we need to be careful operating on bytes when the string might have multi-byte codepoints.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also try to eliminate some of these silently allocated arrays when using functions that accept a params T[] array (as is the case above).

Copy link
Owner

@drewnoakes drewnoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, cheers!

@drewnoakes drewnoakes merged commit 06b0bd0 into drewnoakes:main Jan 30, 2024
2 checks passed
@@ -651,15 +651,15 @@ public sealed class OlympusCameraSettingsMakernoteDescriptor(OlympusCameraSettin
if (Directory.GetObject(OlympusCameraSettingsMakernoteDirectory.TagGradation) is not short[] values || values.Length < 3)
return null;

var join = $"{values[0]} {values[1]} {values[2]}";
var ret = join switch
var ret = (values[0], values[1], values[3]) switch
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this should have been a 2 instead of a 3. I'll push a fix shortly.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -12 to +16
var sb = new StringBuilder(4);
sb.Append((char)reader.GetByte());
sb.Append((char)reader.GetByte());
sb.Append((char)reader.GetByte());
sb.Append((char)reader.GetByte());
return sb.ToString();
Span<byte> bytes = stackalloc byte[4];

reader.GetBytes(bytes);

return Encoding.ASCII.GetString(bytes);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes a behaviour change for some inputs, though I've only seen it on fuzzed files that contain very weird data:

image

The ASCII encoding replaces some characters with ? which potentially loses information. According to https://en.wikipedia.org/wiki/FourCC non-printable characters are valid. I'm not sure it's a problem in practice, but I think I'll make a change here to restore the old behaviour.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants