Bit packing #162

mclearn · 2021-02-14T21:40:39Z

This is a PR that primarily covers permitting the input data to be bit-packed data. We often find that vendors provide us with byte-oriented data, single-bit packed oriented data, and sometimes nibble-packed data (4-bits). It seems that bit-packed data should be a valid input format rather than forcing all data (especially multi-bit-oriented data) to be transformed prior to input.

There are several changes that were made to help this along, and the majority of the changes are in shared/utils.h.
Of note, the following larger changes:

In shared/utils.h, the read_file and read_file_subset functions were refactored to be a single function. They were doing the same thing and therefore duplicating logic. The only difference being the subset length and index being passed in one vs. the other.
As part of the refactor of read_file_subset, more complete error-handling added in typical goto error-handling fashion. This ensures more complete and less error-prone error handling and resource recovery.
The semantics of rawsymbols, symbols and bsymbols are more thoroughly documented in the function. This is important because originally the function assumed the input byte data was the same as rawsymbols (which is not the case for bit-packed data).
In shared/utils.h, a change was made to the seed() function to permit deterministic seeds instead of relying on /dev/urandom thereby aiding in deterministic regression testing.
In the transpose_main.cpp code, bit-packed data is transposed by exploiting the bsymbols data.

Technically, the bit-unpacking code in read_file_subset() will operate correctly on any sample bit alignment, but this was artificially restricted to be 1,2,4 or 8 bits on purpose due to ambiguity on sample alignment.

A test harness was built to help ensure that the bit-oriented data input functions the same as the byte-oriented (masked) data.

Even if the bit-packing PR is considered superfluous, some of these fixes should be considered since it simplifies the code and makes unit-testing a little bit easier.

(Note that there were a number of whitespace changes due to use of space-tab expansions.)

joshuaehill · 2021-02-15T01:29:25Z

Some general comments (I'll try to do a proper review some time tomorrow):

These changes would all be dramatically simpler if your code only actually supported block lengths of 1, 2, 4, and 8. There's lots of error prone state management stuff that just doesn't need to be there (as you point out!). I would suggest dumping all the complexity that enables that (disabled) feature, and just making a simple for loop that works on a per-byte basis. (This alternate approach is discussed and rejected in a comment, but there isn't really any explanation why.)
It would seem that you hardcode endianness conventions (data is packed from msb to lsb). The nice thing about conventions is that there are so many to choose from, and data using different ordering conventions still has to be converted to the one-symbol per byte ordering. In my experience, this is both important and easy to let slip through the cracks in an evaluation, so it's useful to force someone to make an explicit decision when moving between packed bytes and smaller symbols. I'm concerned that supporting exactly one convention will be interpreted as "it's going to magically do the right thing", which is very much not true in some situations. One could note that the conversion from wide symbols to their binary string representation (bsymbols) also makes this assumption; this is so, however these are rather different contexts. The conversion of "wide symbol" to "a string of binary symbols" doesn't need to retain any specific ordering convention, so one is free in this circumstance to make a convention choice. In the packed data case, ordering is important.
The file format convention was already obnoxious for restart data (and the code is correspondingly opaque) before this change. Using packed data (particularly when implemented as generally as it is here) and this makes this considerably more complicated, and considerably harder for the vendor to produce this data correctly.
Your help messages suggest that using packed byte data is "Currently not compatible with -l parameter", and using these together results in an error message. I'm not sure why this functionality was left out (indeed, it is fairly straight forward, as you have already limited the possible packed symbol size!)
It would seem better to just make a command line option to force a deterministic seed, rather than introducing an entirely different interface that isn't used elsewhere in this tool.

mclearn · 2021-02-16T16:56:50Z

Hi Joshua. Thanks for the initial review. Let me rework some of this. The original intention behind the bit-packing format was to be just as opinionated as the original input format requirements to avoid trying to be everything for everyone. However, I had originally thought of something a bit more flexible to deal with the wide range of potential input formats but thought it better to keep it slightly simpler. However, allow me to present the rough outline of the idea.

Using a reworked -p which accepts a required value, we can declare the bit layout. For example, ... -p 11223344 ... input.bin. The idea is that we define a 2-bit sample size and the -p parameter defines how the bits are packed. Zero is reserved for padding/ignoring. Therefore sample one is bits 6 and 7. Sample four is bits 0 and 1. If I wanted a four-bit sample in big-endian format, I could write -p 11112222; in little endian, I could write -p 22221111. If you wanted to replicate the existing input behaviour (upper bits are ignored for sub-8 bit samples), then -p 00001111 which would read one 4-bit sample per input byte. Given this, the positions bits_per_sample would be unnecessary when used with the -p parameter. The bit layout per sample isn't defined which isn't any different from the existing input format requirements (and from what you mention, doesn't seem to matter so much in this case as long as the samples are consistently arranged).

Greg McLearn added 8 commits February 12, 2021 10:03

Bit packing semantics added to file read and transposition

d85bf8c

Updated README and corrected doc error in restart usage help.

2f2e91a

Updated README.

612332c

Fixed up semantics of rawsymbols based on iid testing

82bbe73

Deterministic unit-testing for IID test stream

27eb0b3

Renamed env var to reflect actual purpose

43debd6

Test harness for bit-packing work

0b87995

Updated transpose testing and local variables

80f2dec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bit packing #162

Bit packing #162

mclearn commented Feb 14, 2021 •

edited

Loading

joshuaehill commented Feb 15, 2021

mclearn commented Feb 16, 2021 •

edited

Loading

Bit packing #162

Are you sure you want to change the base?

Bit packing #162

Conversation

mclearn commented Feb 14, 2021 • edited Loading

joshuaehill commented Feb 15, 2021

mclearn commented Feb 16, 2021 • edited Loading

mclearn commented Feb 14, 2021 •

edited

Loading

mclearn commented Feb 16, 2021 •

edited

Loading