Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance: (~3x) optimize field multiplier and EC-adder for improved MSM and ECNTT #693

Merged
merged 2 commits into from
Dec 12, 2024

Conversation

yshekel
Copy link
Collaborator

@yshekel yshekel commented Dec 12, 2024

Describe the changes

  1. fix: ECNTT 5<logn<14 was single threaded. Now multi threaded.

  2. This PR is using inlining and loop unrolling to improve CPU for:

  • scalar multiplication
  • scalar * ECpoint
  • EC addition
  • and consequently ECNTT and and MSM.

ECNTT (MULTI-THREAD) {size=2^14, curve=BN254}:

  • For i9-13900k ubuntu22, clang-14: ~1200ms -> ~450ms (~2.6x)
  • For m1-pro macos-15, clang-16: ~3500ms -> ~1200ms (~2.9x)

MSM (SINGLE-THREAD!) BLS12-377 logn=20:

  • For i9-13900k ubuntu22, clang-14: ~32s -> ~10s (~3.2x)
  • For m1-pro macos-15, clang-16: ~36.5S -> 13.3s (~2.7x)

MSM (MULTI-THREAD!) BLS12-377 logn=20:

  • For i9-13900k ubuntu22, clang-14: ~1.9s -> ~1.9s (~1x)
  • For m1-pro macos-15, clang-16: ~5500ms -> ~3100ms (~1.8x)

CONCLUSION: ICICLE msm is bound by the manager thread. This is why for single thread we see 3X but for multi-thread mac (8-10 cores) improves 1.8x but i9 (32 cores) show no improvement. @LeonHibnik @mickeyasa @Koren-Brand

@yshekel yshekel requested a review from ShanieWinitz December 12, 2024 14:35
Copy link
Contributor

@LeonHibnik LeonHibnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
Left a small comment

icicle/include/icicle/fields/field.h Show resolved Hide resolved
} else {
uint32_t scalar_size = sizeof(S);
// for small scalars, the threshold for when it is faster to use parallel NTT is higher
if ((scalar_size >= 32 && (logn + log_batch_size) <= 13) || (scalar_size < 32 && (logn + log_batch_size) <= 16)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: does this depend on cpu type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yshekel yshekel changed the title fix bug in CPU backend ECNTT for logn<14 improve field multiplier and EC adder for improved MSM and ECNTT Dec 12, 2024
@yshekel yshekel changed the title improve field multiplier and EC adder for improved MSM and ECNTT performance: (~3x) optimize field multiplier and EC-adder for improved MSM and ECNTT Dec 12, 2024
@yshekel yshekel merged commit bef4371 into main Dec 12, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants