-
Notifications
You must be signed in to change notification settings - Fork 0
/
params.json
6 lines (6 loc) · 9.68 KB
/
params.json
1
2
3
4
5
6
{
"name": "Kannon92.GitHub.io",
"tagline": "",
"body": "### Welcome to Kevin Hannon and Jeff Schriber's Project website.\r\nOur goal of this project is to parallelize some hotpots of our strongly correlated quantum chemistry program. \r\n\r\n### Week 0\r\nWe presented a brief summary of our problem. Jeff is going to work on a parallel program for computing the reference. We will present some of the main challenges of our report. \r\n\r\nKevin is trying to parallelize the treatment of dynamic correlation. Here is the main problems in treatment of dynamic correlation: \r\n+ Tensor contractions are required for computing dynamic correlation. \r\n+ Implementations of tensor contractions rely on efficient use of linear algebra software. \r\n+ The tensors are large, so I/O will be a requirement.\r\n\r\nKevin has already implemented an openmp version of this work that works well, but it would be nice to have the problem scale to more nodes. The first task of this project is to implement a MPI version of this algorithm. Kevin believes that this task can be done just by rewriting the code to use MPI rather than openmp. The algorithm only requires very little communication. This is ideal for MPI. However, a major complication is the use of I/O in MPI. This seems to be the main technical challenge for this project. \r\n\r\nThe second aspect of Kevin's part of the project is to attempt to use a hybrid MPI/OpenMP approach. MPI can be used to split the data up into large chunks and then openmp can be used to speed up the loops inside those chunks.\r\n\r\nKevin's main difficulty in parallelizing the code is how to deal with the I/O that must occur. As a first step, Kevin worked on using MPI to parallelize a part of the code that does not require I/O. This seems to work well.\r\n\r\nJeff's phase of the project centers on the construction of these objects. Since the only non-zero contributions of the RDM come from configurations connected by a few electron exchanges, intelligent algorithms precompute lists that map determinants to each other, such that no time is wasted computing a lot of zeros.\r\n\r\nThe difficulty in making these coupling lists is that (a) they can be very memory intensive and (b) they in general cannot be split up such that the RDM can be constructed on a distributed arcitechture. The first part of Jeff's work, which has already promising results, is to reformulate the algorithm such that these coupling lists can be computed in indepentend chunks to be passed across various nodes.\r\n\r\nWith a proper algorithm, the implementation of MPI needs to be optimized such that the communication is kept minimal. This will likely be a challenging part of the project. Parallelization with OpenMP on this problem has been done and is much easier, so a hybrid approach would not add too much additional difficulty.\r\n\r\nWith full parallelization of both aspects, we hope to study chemical systems larger than what conventional multireference methods currently can handle.\r\n\r\n### Week 1\r\n\r\nThe parallelization of Kevin's part of the code has been completed in a naive way. Kevin's program assumed that every MPI process will run a separate computation and write that information to disk. All processesses would run separately until they reached the computational bottleneck. At this point, MPI parallelization was used to reduce the computational cost. This step is relatively simple to parallelize, but there is a serious memory limitation when I went to >8 MPI processes. \r\n\r\nIn order to remove this limitation, I decided to force my code to only run on one process for almost the entire program. This way, I can reduce the memory requirement and send the information to the other processes when MPI is needed. This requires some rewriting of my program, so I have been working on implementing these changes. \r\n\r\nJeff is still working on reformulating the algorithm such that the creation of the lists can be parallelized along with the actual computation of RDMs. While this transition is very easy for the 1-RDM, it is proving very difficult for the 3-RDM, thought progress is being made. The 1-RDM only has two types of excitations— a single alpha electron (a) or a single beta electron (b) is excited. In the 3-RDM, there are four types of excitations— three alpha (aaa), three beta (bbb), and mixtures of both (aab, abb). These mixed-spin excitations are the source of the difficulty, as they prevent simple organization of determinants (i.e., the aaa excited determinants are organized in groups having the same beta string, as these couple to give non-zero contributions to the RDM). An optimal way to organized mixed-spin terms is currently being developed.\r\n\r\nMeanwhile, the computation of the RDMs can still be parallelized even using the old serial construction of the coupling lists. I plan to parallelize this in the next week to test if appreciable speedup can be achieved. \r\n\r\n### Week 2 - 3\r\n\r\nKevin found a difficulty with the previous approach. He was having trouble with too large amounts of memory being used. Kevin recently fixed this problem by forcing most of the code to run on one node except the computational hotspots. The code now correctly runs with multiple MPI processes and can run across separate nodes. A major limitation of this implementation is that it requires that the 3-index integrals are available to all of the processes. Kevin is working on trying to figure out how to send separate blocks of the 3-index integrals, so broadcasting the entire array is not necessary. \r\n\r\nEven with this broadcast, the speed-up is quite good if the 3-index integrals are small enough. For a relatively small system ( 384 basis functions), the speed-up (8 threads / 1 thread) is 5.9. \r\n\r\nThe next steps of this project is to work on implementing algorithms that do not require broadcasting the entire array. My strategy for parallelization is very similar to an openmp style right now. I use the MPI processes to separate the data. I just have to figure out how to spend the correct chunks of data without sending all of it. The last step is to enable the algorithm for use with a disk-based algorithm. By forcing the data to lie on process 1, this becomes a little bit easier because I will now avoid contention with other I/O processes. \r\n\r\nTo compute the 1,2,3 particle RDMS, lists that couple determinants by 1, 2, and three particle excitations need to be computed (think of determinants as strings of 1s and 0s, and swapping the position of a 1 and a 0 represents a single excitaiton). To build these lists, two intermediate maps need to be built for each RDM. One maps the original determinant set to a list of singly (or doubly/triply) annihilated determinants, where 1, 2 or 3 particles are removed in all possible ways. The second list maps the annihilated determinants back to the original set. However, using deteriminants, these intermediate maps quickly become prohibitively large.\r\n\r\nTo scale up this process, the intermediate maps need to be broken down, and this can be done by splitting the determinants into their alpha and beta spin components (these are called strings). For the 1-RDM this extension is easy to implement, since an excitation is either alpha or beta, and the same code as before can be used, but in the drastically smaller basis of strings. For the 2 and 3 RDMs, the implementation is much harder since there are excitations that contain both alpha and beta particles. This past week Jeff has been working on an implementation that builds these components of the higher order RDMs using the lists from lower order RDMs. The current implementation now works, but still needs some optimization. This represents the completion of the algorithmic changes, and now all that needs to be done is a scatter of these lists among nodes, computation of the RDMs, and a concatenation to form the full RDM.\r\n\r\n### Week 4\r\n\r\nKevin has implemented two different algorithms for MPI. His first algorithm assumes that the three index integrals are available to every MPI process. This algorithm has a major limitation in the memory requirements. This algorithm broadcasts the entire three index tensor to all the processes and uses this to form the 4 index integrals. In QC literature, this is called a semi-direct algorithm. I choose to keep this algorithm as it is actually pretty fast and scales quite well. \r\n\r\nThe second algorithm reads only a matrix slice of the three index tensor and sends that to a different process. This algorithm proceeds as follows:\r\n\r\n Get the matrix slice of tensor from rank 0 ( N_Q N_v )\r\n Send this matrix slice ( N_Q N_v)\r\n Recv this matrix slice and compute three index tensor (N_c N_v)\r\n Compute Energy\r\n\r\nThis algorithm is relatively new, so it will require some testing. The speed of this algorithm is limited by the amount of communication that is required. (N_c * N_c communications are required.) Rank 0 also does all of the reading and sending. This will become a bottleneck for large processors and large N_c. \r\n\r\nThis past week, Jeff finalized the serial algorithm for RDM construction. This consisted of fixing pretty weird bugs for limiting test cases, e.g for very large systems and for systems with only two determinants. Additionally, Jeff has combined this code with the MPI version of our larger program, and gotten my end of the code to function normally on a single mpi process. Since the main difficulty (separating the data in the rdm) has been achieved in a serial version of the code, parallelization should be fairly straightforward, and progress in this is being continually made, though is not yet complete.\r\n ",
"note": "Don't delete this file! It's used internally to help with page regeneration."
}