Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a valid Neoverse N1 target. #623

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

everton1984
Copy link

This PR adds a valid Arm Neoverse N1 compilation target using Armv8 kernels. It creates the appropriate registry information and can autodetect a N1 cpu.


// Initialize level-3 blocksize objects with architecture-specific values.
// s d c z
bli_blksz_init_easy( &blkszs[ BLIS_MR ], 8, 6, -1, -1 );
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, that's great to see neoverse n1 tuning. Can I ask you how you came up with these blocksize values ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! To be honest I just wanted the compiler to generate tuned neoverse-n1 code with this patch so blocksize values were taken from thunderx2. If BLIS has a standard procedure to generate those value I am all up for it, please just let me know.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I value what you did, however i don't have the answer for this.
@devinamatthews any pointer you could share ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egaudry Do you think the fine tuning is essential to merge?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a clear interface and arch detection makes sense indeed, however without proper tuning, mergers/reviewers might not see this as a priority.
Just guessing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeff Diamond has better tuning parameters for N1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffhammond Thanks for commenting. Could you please point me to Jeff Diamond so I could ask him if he is able to share his parameters please?

@devinamatthews
Copy link
Member

Having a clear interface and arch detection makes sense indeed, however without proper tuning, mergers/reviewers might not see this as a priority.
Just guessing.

"The establishment" here. @everton1984 thanks for your work but @egaudry is pretty much right; it is best to have specifically-tuned block sizes and/or kernels with performance numbers before creating a new sub-configuration. Otherwise it is just easier to use the thunderx2 subconfig directly. I'll ask Jeff Diamond on the status of the tuned N1 parameters since that code may still be in the clutches of Oracle's lawyers.

@everton1984
Copy link
Author

Having a clear interface and arch detection makes sense indeed, however without proper tuning, mergers/reviewers might not see this as a priority.
Just guessing.

"The establishment" here. @everton1984 thanks for your work but @egaudry is pretty much right; it is best to have specifically-tuned block sizes and/or kernels with performance numbers before creating a new sub-configuration. Otherwise it is just easier to use the thunderx2 subconfig directly. I'll ask Jeff Diamond on the status of the tuned N1 parameters since that code may still be in the clutches of Oracle's lawyers.

@devinamatthews Thanks for answering. No problem it makes sense, I can generate the parameters just wanted to know before trying something ad-hoc if there is a particularly defined procedure to obtain them.

@devinamatthews
Copy link
Member

devinamatthews commented Apr 7, 2022

The block sizes can, to some extent, be determined analytically, see https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf. A basic non-analytical strategy is:

  1. Run a series of problems with m=MR, n=NR, and increasing k. Note that you will need to use a row- or column-major C matrix as preferred by the microkernel. Plot performance vs. k; the optimal kc should be:
    a. The peak of the plot if the curve is sharpy peaked.
    b. The smallest value such that good performance is achieved if the plot has a large plateau.
  2. Run a series of problems with n=NR, k=KC, and increasing m (you might want to try different transpose options for A as well). As before, the optimal MC is either the peak or the smallest value that gives good performance.
  3. The value of NC doesn't usually affect performance much, but you can also try a similar procedure as for KC and MC. Note that NC should in general be fairly large compared to MC.
  4. Confirm performance for large square matrices and tweak as necessary. Finding the best threading parameters is another challenge which perhaps I can describe separately if you're interested.

Final note: The block sizes must satisfy MC%MR == 0 and NC%NR == 0. If possibly it doesn't hurt to have all three cache block sizes as multiples of both MR and NR unless this choice is too restrictive. It may also help to avoid large powers of 2.

@everton1984
Copy link
Author

The block sizes can, to some extent, be determined analytically, see https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf. A basic non-analytical strategy is:

1. Run a series of problems with m=MR, n=NR, and increasing k. Note that you will need to use a row- or column-major C matrix as preferred by the microkernel. Plot performance vs. k; the optimal kc should be:
   a. The peak of the plot if the curve is sharpy peaked.
   b. The smallest value such that good performance is achieved if the plot has a large plateau.

2. Run a series of problems with n=NR, k=KC, and increasing m (you might want to try different transpose options for A as well). As before, the optimal MC is either the peak or the smallest value that gives good performance.

3. The value of NC doesn't usually affect performance much, but you can also try a similar procedure as for KC and MC. Note that NC should in general be fairly large compared to MC.

4. Confirm performance for large square matrices and tweak as necessary. Finding the best threading parameters is another challenge which perhaps I can describe separately if you're interested.

Final note: The block sizes must satisfy MC%MR == 0 and NC%NR == 0. If possibly it doesn't hurt to have all three cache block sizes as multiples of both MR and NR unless this choice is too restrictive. It may also help to avoid large powers of 2.

@devinamatthews Thanks a lot! Let me find the correct parameters then.

@devinamatthews devinamatthews self-assigned this Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants