Evaluation of scalars is done essentially via binary matmuls. This is a bit hard to see, but effectively is what happens after the VMAP transformation.
Generally, kernels are not optimized for binary matmul, so one could consider using floating-point matmul instead.
An additional strategy is bit-packing, which would also significantly save memory.
Some profiling should definitely be done for this ticket.