You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we save and restore the FPU state in the following places:
linux.cc - syscall
core/elf.cc - elf_resolve_pltgot
arch/x64/exceptions.cc - interrupt and general protection
arch/x64/mmu.cc - page fault
arch/x64/signal.cc - when calling signal handler
Typically it is achieved by calling xsave and xrstor which are pretty expensive operations. Its total cost can be measured indirectly by commenting out FPU lock code in linux.cc syscall function and running misc-syscall-perf test. The difference between them is on average 70 ns.
It would be nice to figure out how we can take advantage of xsaveopt and other instructions to speed it up. For details please see this excellent article.
Unfortunately, it is not clear how exactly we would need to change FPU saving/restoring code to take advantage of these instructions. I kind of understand the FPU state needs to be saved in the same memory locations for xsaveopt to work correctly. But how exactly would this work across multiple threads?
Also, I am not sure how much performance gain would we see in practical terms. I did hack the FPU save/restore code in syscall function in linux.cc to use some global FPU state variable and I could see 70ns reduced to ~50ns.
Currently, we save and restore the FPU state in the following places:
linux.cc
- syscallcore/elf.cc
- elf_resolve_pltgotarch/x64/exceptions.cc
- interrupt and general protectionarch/x64/mmu.cc
- page faultarch/x64/signal.cc
- when calling signal handlerTypically it is achieved by calling
xsave
andxrstor
which are pretty expensive operations. Its total cost can be measured indirectly by commenting out FPU lock code inlinux.cc
syscall function and runningmisc-syscall-perf
test. The difference between them is on average 70 ns.It would be nice to figure out how we can take advantage of
xsaveopt
and other instructions to speed it up. For details please see this excellent article.Unfortunately, it is not clear how exactly we would need to change FPU saving/restoring code to take advantage of these instructions. I kind of understand the FPU state needs to be saved in the same memory locations for
xsaveopt
to work correctly. But how exactly would this work across multiple threads?Also, I am not sure how much performance gain would we see in practical terms. I did hack the FPU save/restore code in syscall function in
linux.cc
to use some global FPU state variable and I could see 70ns reduced to ~50ns.Some other relevant notes:
The text was updated successfully, but these errors were encountered: