-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exploiting symmetries #16
Comments
OK I agree with the work flow, but then we should first merge the ma00aa branch with master (I will do some final testing) and then create a new branch called something like "fastspec" and work on these matters from there. I will get this set tomorrow and let you know. |
@SRHudson wrote:
I performed a study with the l2stell.sp testcase, increasing Fourier resolution (Mpol=Ntor=4,6,8), measuring the total computation time, and comparing the two SPEC versions ("fast" is from ma00aa branch, "slow" is from master branch). Find attached the result of this exercise. l2stell_cputime_versus_fourier.pdf For example, at m=n=8, the slow code takes ~ 1558 sec and the fast code takes ~645 sec, a factor of ~2.5 speedup, which seems to remain constant with increasing fourier resolution. |
More over, along these runs I did profiling, and found that the ma00aa routine increases percentage of total cpu time as Fourier resolution increases. This means that code speedup relies on ma00aa-optimization even more at high Fourier resolution. To illustrate this point, here are some numbers: "slow spec" "fast spec" |
Comment One) What you are calling "slow spec" is actually non-stellarator-symmetric spec, so comparisons between "fast spec", which exploits stellarator symmetry and slow spec are not particularly meaningful. Recall that the only speed improvements that I made to ma00aa (beyond the re-looping and eliminating divisions as suggested by Sam) was to not compute the non-stellarator-symmetric terms, which were nowhere being used. Comment Two) It is interesting that at higher M, N that ma00aa takes a higher percentage of the total time. I will continue to work on this, as I have described earlier (by removing the calculation of DToocc, DToocs, DToosc and DTooss from ma00aa to preset). |
@SRHudson not sure I understand your Comment One,
Does this mean that you modified ma00aa.h in such a way that IF the system is stellarator symmetric THEN those terms are not calculated, but that in general these are still calculated and thus no much speedup is expected? |
Ultimately, it would be nice to determine a scaling law, e.g., It is very interesting that matrix seems to be taking a lot of time. There is NO computation performed in matrix. It is just loops and assigning matrix elements. I guess that by re-looping that the time of matrix can be reduced significantly. |
In As of last week, all the terms were being calculated regardless of Istellsym. All I did was to not compute the non-stellarator-symmetric terms if Istellsym = 1. For non-stellarator-symmetric equilibria, as you can see from |
@SRHudson OK I understand, although somehow I do not see any red terms when I open |
How are you doing the timing? |
So, do the results from the different timings agree? |
No the times do not agree. This is how I do it, you can try:
|
Ok, I made with CC=intel_prof, ran xspec, gprofed xspec and got this: My timing shows that mp00ac was taking all the time, and that the code took about 83sec. I guess that most of you will regard the standard gnu profiling extension as being the more reliable, but my timing agrees with my watch!! Perhaps the difference is because I am using MPI_WTIME() as to provide the time. What is more important: the "timing" provided by the so-called standard gnu profiling extension, or the time that the user has to wait to get the result. |
I am still curious about the timing. Can the gnu profiling handle handle the calls to the NAG library? Also, is there a way to "profile" the memory requirements? I have absolutely no clue how much memory SPEC uses, but I am sure that the memory can be reduced significantly. |
@SRHudson , no it can't tell you time spent in those subroutine because we don't compile those libraries, we only link to them. However, while gprof is nice, there are better tools. And the discrepancy between gprof and wall clock timing is well documented. Regarding memory, you just need to compile with the -g option and run with valgrind as a first pass. The tool is good for detecting uninitialized variables, failures to deallocate variables, and other memory related issues. At some level memory reduction begins to interfere with readability of codes. I suspect this is why many legacy codes read more like letter salad as they attempt to reuse variables. |
OK I tried the Ltiming option and also got that mp00ac takes most of the time (by far: 10 times more than any other subroutine). However, it is weird because if I rerun the same case it prints out different numbers, e.g. for the mp00ac subroutine:
SECOND RUN
while both runs took the same "watch time"...also, in the ipp-cluster I get all zeros if I try Ltiming...perhaps some environment variable related to timing is messing up? In any case, it looks like Ltiming says that mp00ac dominated cpu-time, and gprof says that ma00aa dominates cpu-time: maybe the latter neglects NAG calls, that could be the reason. If this is the case, then we should trust Ltiming output and try to optimize mp00ac. |
|
If I understand correctly, there is an ongoing "speedup effort" which translates into modifications of the ma00aa branch (which should change to a more generic name since it seems that other routines like mp00ac are crucial). If that is the case, then we should cancel the "pull request" that is been open for a while with the initial idea of merging that branch into master - but that should take quite some time. |
I have pushed my changes to ma00aa. I will take a break on this topic now, so this would be a good time for you to pull and merge this branch with the master. We should also merge the NAG_REPLACE with the master. I would expect that we should not let different branches get too far removed from each other, and it is helpful for checking the modifications to NAG_REPLACE if the executable is fast. Speed improvements to spec will be an ongoing exercise, which suggests that we close ma00aa and open another branch called "rapido", or "azogue", or "relampago". I will let our Spanish expert have the final say on this. |
The ma00aa branch has been merged into master and deleted. I would wait until the NAG_REPLACE branch reaches a point in which the code runs well (from the discussions, it seems that some debugging is on the way). Then I can perform some tests and compare the master and NAG_REPLACE branches, before considering merging. If some new "speedup" improvements are to be made, e.g. by exploiting symmetries in mp00ac, then the person starting such changes should create a new branch,
make some initial changes, and then,
|
Dear Speculators,
I am still working on reducing the computation time in ma00aa and other subroutines. I see that the ma00aa thread has vanished, but not to worry: I am becoming quite adept at opening new issues!
If you examine
http://w3.pppl.gov/~shudson/Spec/ma00aa.pdf
you will notice that DToocc, DToocs, DToosc and DTooss do not depend on geometry. They need to be calculated only once during the entire calculation. (The Chebyshev resolution may differ in each volume, so some re-indexing of these arrays will be required.)
Also, there are various symmetries of the other arrays constructed in ma00aa that could be exploited. Now that I am getting back into the fine print of SPEC, this should not be too difficult, and I intend to pursue this task. It may be that I get further reduce the time of ma00aa by a factor of 4.
However, I would like to measure timing before and after the changes. You are already aware, I believe, of the timing improvements that I made to ma00aa. This was no big achievement really, as this routine had not yet exploited stellarator symmetry. Also, I believe that the l2stell input file is too tiny to really see the impact of the speed improvements that I intend to pursue, and whether the bottleneck in SPEC when we use large Fourier resolution and a large number of interfaces is really with ma00aa or somewhere else. We need to get SPEC fast on experimentally relevant input files. (Eventually, we want every part of the code to be maximally efficient. This is just a question of priority.)
So, I suggest that Joaquim consider the following:
I will proceed on exploiting all the symmetries and so associated with ma00aa and the related routines.
The text was updated successfully, but these errors were encountered: