CPU and Memory Optimizations #56

yuzawa-san · 2024-07-23T00:55:03Z

I found the CPU and Memory in the decode hot path is very high, so I did some light refactoring to alleviate this.

Added a BitString and BitStringBuilder for efficient bitstring operations
Make substring operation zero-copy (BitString slices the underlying data)
BitString also removes the need to validate 0/1 using Pattern.
Optimize base64 decoding by assembling reverse dictionary lookup table from char to BitString (previously it was a HashMap and from Character (boxed) to Integer, which needed to be converted to a bit string)
Reduce number of substring operations. e.g. FixedIntegerEncoder.decode(bitString, fromIndex, length)
Made FixedBitfieldEncoder return BitString directly which does fulfill List<Boolean>. This is a lot smaller that the ArrayList<Boolean>
Use more StringBuilder
Presize things that we know the size or approximate sizes of
Use more constants. NOTE: I used Strings constants in the + usages since that is optimized (in JDK8) to a StringBuilder. if it used the char constant it appears to have to convert those chars to Strings each time. however in later JDK's this is not needed.
NOTE: the encode flow could technically use the BitString too, but i held off on that for now.
added an IntegerCache which is largest enough to contain all of the vendor ID's in the global vendor list. This cut down on a lot allocations.
Added JMH microbenchmarking
made constants final
do more presizing in segment initializeData
make GppModel more DRY and use switch statements instead of if/elses
contains containsKey + get calls to just get call with nullcheck. this avoids double Map reads.
make initializeSegments more memory efficient with Arrays.asList or Collections.singletonList
use CharSequence for zero copy string splits

microbenchmark results:

before
Benchmark                                  Mode  Cnt        Score      Error   Units
MyBenchmark.decodeGpp                     thrpt   25     3425.619 ±  103.495   ops/s
MyBenchmark.decodeGpp:gc.alloc.rate       thrpt   25     6099.516 ±  186.780  MB/sec
MyBenchmark.decodeGpp:gc.alloc.rate.norm  thrpt   25  1867059.215 ± 4110.174    B/op
MyBenchmark.decodeGpp:gc.count            thrpt   25     2632.000             counts
MyBenchmark.decodeGpp:gc.time             thrpt   25     2700.000                 ms

after
Benchmark                                  Mode  Cnt      Score     Error   Units
MyBenchmark.decodeGpp                     thrpt   25  19205.372 ± 485.226   ops/s
MyBenchmark.decodeGpp:gc.alloc.rate       thrpt   25   1037.076 ±  26.204  MB/sec
MyBenchmark.decodeGpp:gc.alloc.rate.norm  thrpt   25  56624.003 ±   0.001    B/op
MyBenchmark.decodeGpp:gc.count            thrpt   25    866.000            counts
MyBenchmark.decodeGpp:gc.time             thrpt   25    842.000                ms

seems to be around 6x faster than the last released version and uses only about 97% less memory.

ad-hoc benchmark code against https://github.com/InteractiveAdvertisingBureau/iabtcf-java (partially ported into JMH)

public class TcfBench {
	
	private static final String in = "CQCDewAQCDewAPoABABGA9EMAP-AAB4AAIAAKVtV_G__bXlv-X736ftkeY1f9_h77sQxBhfJs-4FzLvW_JwX32EzNE36tqYKmRIAu3bBIQNtHJjUTVChaogVrzDsak2coTtKJ-BkiHMRe2dYCF5vmwtj-QKZ5vr_93d52R_t_dr-3dzyz5Vnv3a9_-b1WJidK5-tH_v_bROb-_I-9_5-_4v8_N_rE2_eT1t_tevt739-8tv_9___9____7______3_-ClbVfxv_215b_l-9-n7ZHmNX_f4e-7EMQYXybPuBcy71vycF99hMzRN-ramCpkSALt2wSEDbRyY1E1QoWqIFa8w7GpNnKE7SifgZIhzEXtnWAheb5sLY_kCmeb6__d3edkf7f3a_t3c8s-VZ792vf_m9ViYnSufrR_7_20Tm_vyPvf-fv-L_Pzf6xNv3k9bf7Xr7e9_fvLb__f___f___-______9__gAAAAA.QKVtV_G__bXlv-X736ftkeY1f9_h77sQxBhfJs-4FzLvW_JwX32EzNE36tqYKmRIAu3bBIQNtHJjUTVChaogVrzDsak2coTtKJ-BkiHMRe2dYCF5vmwtj-QKZ5vr_93d52R_t_dr-3dzyz5Vnv3a9_-b1WJidK5-tH_v_bROb-_I-9_5-_4v8_N_rE2_eT1t_tevt739-8tv_9___9____7______3_-.IKVtV_G__bXlv-X736ftkeY1f9_h77sQxBhfJs-4FzLvW_JwX32EzNE36tqYKmRIAu3bBIQNtHJjUTVChaogVrzDsak2coTtKJ-BkiHMRe2dYCF5vmwtj-QKZ5vr_93d52R_t_dr-3dzyz5Vnv3a9_-b1WJidK5-tH_v_bROb-_I-9_5-_4v8_N_rE2_eT1t_tevt739-8tv_9___9____7______3_-";


	public static void main(String[] args) {
		while(1==1) {
		 TCString old = TCString.decode(in);
		old.getPubPurposesConsent();
		old.getPurposesConsent();
		old.getVendorConsent();
		old.getPurposesLITransparency();
		old.getVendorLegitimateInterest();
		old.getSpecialFeatureOptIns();
		old.getCmpId();
		old.getPublisherRestrictions();
		TcfEuV2 nu = new TcfEuV2(in);
		nu.getPublisherConsents();
		nu.getPurposeConsents();
		nu.getVendorConsents();
		nu.getPurposeLegitimateInterests();
		nu.getVendorLegitimateInterests();
		nu.getSpecialFeatureOptins();
		nu.getCmpId();
		nu.getPublisherRestrictions();
		nu.setFieldValue(TcfEuV2Field.CMP_ID, 14);
		nu.encode();
		
		}
	}

}

benchmarked using async-profiler asprof -d 20 -e cpu,alloc -f ~/Desktop/dump16.jfr TcfBench

Flame Charts:

memory before:

memory after (note how TCString parse was small teal sliver in the before graph, but the iab-gpp portion has shrank so much that the TCString parse is now a larger percentage of the icicle chart):

cpu before:

cpu after:

Future Ideas (not in PR). I'll likely open issues to discuss:

List<Integer> is still a little bulky in the charts above. I was thinking of maybe making a specialty class backed by int[] like https://github.com/InteractiveAdvertisingBureau/iabtcf-java/blob/master/iabtcf-decoder/src/main/java/com/iabtcf/utils/IntIterable.java but that would likely break API stability
Should the fields map have keys be enum's instead of strings? The EnumMap would be a lot more lightweight.
is there too much stuff which is public which should be not public?
should List be replaced with Set since I feel set contains is something worth optimizing (i.e. is this vendor(s) present in the set)
are the defensive copies in some getValue implementation necessary, could we achieve the same from returning read-only versions (Collections.unmodifiableList).

fixes #25

supersedes #45

ChristopherWirt · 2024-07-25T10:05:36Z

This looks great 👍

yuzawa-san · 2024-08-21T17:53:57Z

@iabmayank @chuff can you please take a look at this?

yuzawa-san · 2024-11-06T00:28:25Z

@iabmayank @chuff i have rebased off of the most recent master

yuzawa-san mentioned this pull request Jul 23, 2024

Reduce the number of object allocations on the decode/encode hotpath #45

Open

yuzawa-san force-pushed the cpu-memory-optimizations branch from 5e4b971 to 02f53c4 Compare July 25, 2024 16:34

yuzawa-san mentioned this pull request Jul 25, 2024

High CPU Consumption #25

Open

This was referenced Oct 15, 2024

Fail fast #55

Open

Tx fl or mt #57

Open

AntoxaAntoxic approved these changes Oct 15, 2024

View reviewed changes

yuzawa-san added 3 commits November 5, 2024 18:49

cpu and memory optimizations

c990104

zero copy split

8afa986

optimize fl,mt,or,tx

c30373e

yuzawa-san force-pushed the cpu-memory-optimizations branch from 8bac901 to c30373e Compare November 6, 2024 00:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU and Memory Optimizations #56

CPU and Memory Optimizations #56

yuzawa-san commented Jul 23, 2024 •

edited

Loading

ChristopherWirt commented Jul 25, 2024

yuzawa-san commented Aug 21, 2024

yuzawa-san commented Nov 6, 2024

CPU and Memory Optimizations #56

Are you sure you want to change the base?

CPU and Memory Optimizations #56

Conversation

yuzawa-san commented Jul 23, 2024 • edited Loading

ChristopherWirt commented Jul 25, 2024

yuzawa-san commented Aug 21, 2024

yuzawa-san commented Nov 6, 2024

yuzawa-san commented Jul 23, 2024 •

edited

Loading