Skip to content

Latest commit

 

History

History
110 lines (95 loc) · 5 KB

README.md

File metadata and controls

110 lines (95 loc) · 5 KB

XOR vs std::swap

There is an old bit twiddling trick that uses XOR to swap two integer values. Theoretically, this optimization saves one register, and on older compilers with worse register allocators, a whole stack allocation.

void xor_swap(int& x, int& y) {
  x = y ^ x;
  y = x ^ y;
  x = y ^ x;
}

However, is this trick still relevant today? Modern processors make use of Tomasulo's algorithm to rename registers on the fly and avoid stalls from write after read or write after write hazards.

To test the performance of the xor trick, we use a swap-heavy algorithm like bubble-sort. Unfortunately, clang is too smart on my machine and compiles both variants of the code to efficiently use the stp (store pair of registers) instruction. The solution is to write inline assembly for this functionality.

#define xor_swap(x, y) \
  { asm("eor %0, %1, %0; eor %1, %0, %1; eor %0, %1, %0;" : "+Kr" (x), "+Kr" (y)); }

Gross. For the uninitiated, %0 and %1 refer to the output operands that are specified after the colon. +Kr (x) means we want a register (r) that we can read and write to (+), be used with 32-bit logical instructions (K), and is where the variable x is stored.

Anyways, we can now run the algorithm on different length random number arrays to get a result. For random number generation, we read from /dev/urandom. This may be slower than reading a small seed for a pseudo random number generator (PRNG), but it works.

Below are the results. std::swap gets the win by a significant margin even on the relatively small input of 50,000 4-byte integers.

We can disassemble our programs to double check the inline assembly is not messing up a compiler optimization:

$ clang -S -DSWAP_FUNCTION=std::swap -O2 main.cpp -o std::swap.s
$ clang -S -DSWAP_FUNCTION=xor_swap -O2 main.cpp -o xor_swap.s
$ git diff std::swap.s xor_swap.s

The "-S" flag makes clang emit assembly instead of machine code. The relevant diff is below. The lines starting with "+" are lines present in xor_swap, but not in std::swap.s and vice versa for minus.

...
 	bl	_clock_gettime_nsec_np
@@ -73,35 +73,37 @@ Lloh1:
 	b	LBB0_9
 LBB0_8:                                 ;   in Loop: Header=BB0_9 Depth=1
 	sub	x8, x8, #1
-	tbz	w11, #0, LBB0_15
+	tbz	w12, #0, LBB0_14
 LBB0_9:                                 ; =>This Loop Header: Depth=1
-                                        ;     Child Loop BB0_13 Depth 2
+                                        ;     Child Loop BB0_12 Depth 2
 	cmp	x8, #2
-	csel	x12, x8, x10, hi
+	csel	x11, x8, x10, hi
 	tst	x8, #0xfffffffe
-	b.eq	LBB0_15
+	b.eq	LBB0_14
 ; %bb.10:                               ;   in Loop: Header=BB0_9 Depth=1
-	mov	w11, #0
-	sub	x12, x12, #1
-	ldr	w13, [x19]
-	mov	x14, x9
-	b	LBB0_13
-LBB0_11:                                ;   in Loop: Header=BB0_13 Depth=2
-	stp	w15, w13, [x14, #-4]
-	mov	w11, #1
-LBB0_12:                                ;   in Loop: Header=BB0_13 Depth=2
-	add	x14, x14, #4
-	subs	x12, x12, #1
+	mov	w12, #0
+	sub	x11, x11, #1
+	ldr	w15, [x19]
+	mov	x13, x9
+	b	LBB0_12
+LBB0_11:                                ;   in Loop: Header=BB0_12 Depth=2
+	add	x13, x13, #4
+	mov	x15, x14
+	subs	x11, x11, #1
 	b.eq	LBB0_8
-LBB0_13:                                ;   Parent Loop BB0_9 Depth=1
+LBB0_12:                                ;   Parent Loop BB0_9 Depth=1
                                         ; =>  This Inner Loop Header: Depth=2
-	ldr	w15, [x14]
-	cmp	w13, w15
-	b.gt	LBB0_11
-; %bb.14:                               ;   in Loop: Header=BB0_13 Depth=2
-	mov	x13, x15
-	b	LBB0_12
-LBB0_15:
+	ldr	w14, [x13]
+	cmp	w15, w14
+	b.le	LBB0_11
+; %bb.13:                               ;   in Loop: Header=BB0_12 Depth=2
+	; InlineAsm Start
+	eor	x15, x14, x15	; eor x14, x15, x14; eor x15, x14, x15;
+	; InlineAsm End
+	stp	w15, w14, [x13, #-4]
+	mov	w12, #1
+	b	LBB0_11
+LBB0_14:
 	mov	w0, #12
 	bl	_clock_gettime_nsec_np
...

If you look closely, you will see that there are some renamed registers and labels, but they are almost identical programs with one having some extra xor instructions. In fact, the same stp instruction is used, just with a different order of operands.

The conclusion is that you should write the simplest code possible and not waste time including these bit twiddling tricks that only serve to confuse the reader. Modern optimizing compilers don't even emit the right assembly for these snippets. Even embedded processors running code compiled with the lowest optimization level use scoreboarding or other methods to abstract over registers, so I doubt these benchmark results would differ much by any architecture or processor made in the last decade.