Making 2Q GST Go Vroom Vroom #496

coreyostrove · 2024-10-07T04:44:15Z

Summary

Two-qubit GST has been a perennial computational headache serving as a bottleneck for the practical deployment of two-qubit GST in many settings. The changes in this PR go a ways toward making this analysis faster, with a focus on improving the computational efficiency of the MapForwardSimulator.

BLUF: In end-to-end two-qubit GST with the changes in this PR we are getting a factor of 6 speedup compared to the current default configuration. Details on the test parameters later on in this post. Error generator parameterized models are notably nearly as fast as process matrix parameterized models (or at least significantly closer in performance than they were before).

Guiding Principle:

Profiling end-to-end GST on develop points to one clear culprit for why two-qubit GST is so costly, and that is the computation of Jacobians. Specifically the derivatives of circuit outcome probability distributions with respect to model parameters. Previous profiling has found that ~90-95% of the total end-to-end runtime is accounted for by the Jacobian calculation. As such, while there might be a couple other inefficiencies addressed, the main focus of the changes in this PR are to accelerate that computation.

As a note to the poor souls reviewing this, please feel free to ask for more details wherever necessary. If you're reading something and wondering why a given change was made the answer is almost certainly that I profiled the change and found it to be meaningfully faster (where I'll nebulously choose to leave "meaningfully" undefined, but you can safely assume it excludes "a wee bit"). Essentially every change was A/B tested using various profiling tools.

Code Updates:

Lazier Model Parameter Updates

This first change is both one of the simplest, and was guided by an observation I made when poking around inside the model parameter maintenance code inside OpModel while working on #482. Namely, the way we were presently doing model parameter updates was rather ham fisted, specifically in the context of Jacobian calculations. For context, with the MapForwardSimulator Jacobians are done using finite differencing. The way we had been doing model parameter updates for each perturbation during the FD loop used a call to the model's from_vector method. This method updates the representations of every member of a model, however, regardless of whether any of the parameters owned by that model member had actually changed. This is a big deal especially for error generator parameterized models where profiling found that ~75% of the time spent on jacobian calculations for those models was being spent just on the model updates.

This PR addresses this by introducing new methods to the OpModel class called set_parameter_value and set_parameter_values which allow for updating individual or specific subsets of parameters. Importantly, these methods are backed up by additional infrastructure in the OpModel class which allows for updating just those model members which a given parameter touches.

While primarily targeted at speeding up error generator parameterized models, this change nontrivially speeds up the performance of process matrix models too.

Quick note, I couldn't be arsed to figure out how to propagate this through interposers currently, so this only works as described above for models without an interposer. If we detect one these methods instead fall back to using the from_vector method to perform the updates.

Faster Model Parameter Updates

I also made some updates to the code for doing the model parameter updates themselves to speed those up, specifically of error generator parameterized models. Here the main changes were for LindbladCoefficientBlocks and LindbladErrorgen where for both the main change was further vectorization of their update implementations. For the coefficient blocks it was vectorization of the block data updates, and for LindbladErrorgen it was further vectorizing the error generator matrix reconstruction step.

Circuit Parameter Dependence

Another inefficiency occurs when we calculate the derivative of a circuit's outcome probabilities with respect to a model parameter that it simply doesn't touch. Most commonly this is because the gate associated with that parameter simply doesn't appear in the circuit. As such, by performing forward simulation only for those circuits a parameter is known to touch we can potentially save a lot of redundant computation. To get a sense of exactly how big of a savings one might expect to get from making this change we can calculate the average fraction of parameters a circuit depends on in the context of a GST experiment design. I seem to have misplaced the notebook where I had calculated this (for I can't confirm maximum depth of the experiment designs exactly), but for one-qubit I believe I tested a depth 128 XYI gate set experiment design (from the modelpack) and for two-qubits a depth 16 XYCPHASE experiment design (again from the modepack). Here I found the average circuit depended on 79% and 67% of the parameters, respectively. As such we can in principle get up to a 21% speed up in the one-qubit case and a 33% speed up in the two-qubit case.

To support this we now have a new method for the OpModel class called circuit_parameter_dependence. This takes as input a list of circuits and returns a dictionary whose keys are the input circuits, and whose values are the list of indices of model parameters touched by that circuit. Here "touched" should not be construed as implying sensitivity to that parameter, per se, but simply that the model parameter is interacted with period. Optionally a reverse mapping can be returned (which is also useful) with keys given by model parameter indices and values corresponding to lists of circuits which touch that parameter.

Not running redundant circuits is all well and good, but a problem arises doing this in practice with the MapForwardSimulator when coupling this change with use of the simulator's caching functionality. The TLDR here is that the MapForwardSimulator makes use of something called a PrefixTable (more to be said on this subject soon) to identify and describe an efficient evaluation strategy for a list of circuits, which can possibly include the use of caching for intermediate results. As such, while a circuit might be redundant, if it's final output state was something the simulator expected to be cached as intermediate result and it wasn't run this would break things. To remedy this the solution I can up with was to introduce a new PrefixTable container/construction called imaginatively PrefixTableJacobian. What this does is construct prefix tables for each of the model parameters (or rather technically model parameter equivalence classes) which include only the subset of circuits that parameter touches in its evaluation strategy. There is some additional structure here, the equivalence class nonsense I alluded to which is really just a fancy way of saying that most of the time we just need to do this construction once per gate instead of once per parameter, so the actually construction isn't as expensive as it might otherwise sound.

Implementing this all and modifying the map simulation code for the python implementation (densitymx_slow) this circuit parameter dependence update coupled with the new prefix tables gave almost exactly the predicted speed up when profiled, for both one and two-qubits. (Note to self: I don't remember how carefully I did correctness testing for densitymx_slow when making these changes so I should triple check this also gives the correct answer still).

Implementing this for the cython implementation I unfortunately didn't get a net speedup, but I have very good reason to believe this is mostly a skill issue on my end. The TLDR is that despite my best efforts I wasn't able to figure out how to re-use the memory allocated for the prefix cache across the different parameter equivalence classes without crashing the compiled c code at runtime. As such, the only was I was able to get this to successfully work was by making a brand new prefix cache for every parameter, but in practice these extra memory allocations are expensive and turned this into a wash. I am pretty confident that with around an hour of pair programming alongside someone with basic competency in c/c++ (which despite having taken two semesters of as an undergrad I seem to have almost completely forgotten) we could fix the memory allocation issues and enact this speed up. See commit 4b44ed5 for more on the stab I took on the cython side.

For now the use of this functionality is gated around the use of the densitymx_slow evotype, and I'm sure folks will think of some other uses for the circuit parameter dependence code in OpModel. One last note, I'm pretty sure this won't work for models with parameter interposers, but looking back at my code I don't seem to have added handling for detecting this edge case so this will need to be added before we merge this in.

PrefixTable Costing

Some new methods have been added to PrefixTable for calculating the total and per-circuit cost of the evaluation strategy contained within, measured in units of 'state propagations'. I.e. the number of matrix-vector multiplications needed to evaluate every circuit's output probabilities*.

*Depending on the effect reptype being used this might undercount by one (per-circuit), I'm not sure how we would detect this in the context of the PrefixTable code though since this is a model dependent detail which it doesn't presently have access to (and I'm not positive it is a good idea to plumb that in).

PrefixTable Inefficiency Fix and New Splitting Algorithm

While testing out the new costing functions I noticed some strange behavior which was easy to spot in that format, but would be harder to spot in the format used for specifying the evaluation strategy unless you were specifically looking for it. This is best demonstrated with an example. Below are two circuits along with their cost in state propagations given the evaluation strategy constructed by the PrefixTable:

Circuit(rho0[][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][]@(0)) : 32
Circuit(rho0[][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][]@(0)) : 16

What's the deal? If we're using caching shouldn't the first circuit only require 16 additional propagations? Turns out the issue was related to using the circuit comparison logic in the Circuit class as the sorting function, which I think is borked and reported on in more detail here #487. The fix for this issue was fairly straightforward, all it took was to switch to sorting the list of circuits in question by their length instead of this borked method before doing the prefix detection, and the resulting evaluation strategies correctly now correctly identified the second circuit as a prefix for the first. The problem was that I was now finding that the splitting algorithm (used to divide circuits into shorter lists to distribute among atoms) was working like crap and incurring a huge additional cost in terms of the number of additional state propagations that now needed to be performed in aggregate. I now understand why that is, and it is related to the greedy partitioning algorithm that is used by the original splitting algorithm, and also why it didn't behave this way before (essentially by accident because of the manner in which it interacted with the weird sorting).

In the course of addressing the now crapified prefix table splitting performance the splitting algorithm has now been entirely reimplemented. Here is an overview of the new algorithm:

To begin we take the list of circuits to be evaluated and transform these circuits into a tree structure. Fo each circuit beginning at the state preparation we loop through the circuit layer by layer. For each layer we look through the children of the node where we’re presently located. If the next layer in the circuit is already a child we move to that node, if not we add a new child node and move to it. We repeat this process until we get to the end of the circuit, at which point we mark the final node in our traversal with the index of the circuit in our original list which is is evaluated be the sequence of gates beginning at the root of the tree and terminating at that final point. The final tree constructed in this manner summarizes the optimal evaluation strategy for this set of circuit and by construction minimizes the total number of state propagations needed to evaluate all of the circuits. This is easier seen than described, so here is an example of such a tree for the evaluation of the subset of circuits in the standard XYI gate set’s modelpack experiment design containing an idle gate (the full experiment design gets a bit unwieldy to plot).

The picture above summarizes the optimal evaluation strategy for a set of circuits in terms of the number of state propagations when using caching. By summing the number of edges one can recover the minimal number of propagations required. When we're using multiple workers how do we partition this tree among workers efficiently? Here there are two competing notions of efficiency. One notion aims to balance the number of circuits whose outcome probabilities are being calculated on each worker. This is the notion most relevant if the goal is to balance the memory requirements for the simulation among workers. The other notion aims to balance the amount of computation (number of total state propagations) among workers to minimize idle time. Any partitioning scheme will necessarily introduce redundancy, so in both cases we also want to minimize the amount of redundant computation introduced. You'll notice that in the figure above the edges of the tree are labeled by weights that increase monotonically with increasing distance from the root node. These edge weights represent the additional cost incurred in redundant state propagations by cutting that edge to partition the tree.

The problem of partitioning circuits among workers can now be stated as a combinatorial graph theoretic optimization. Given an evaluation tree where nodes are weighted either by their evaluation cost or the number of circuits in the original circuit list they correspond to (depending on which of the two aforementioned cost functions one is using, and where the edges are weighted by the computational cost of cutting them, the problem is to find the optimal set of cuts which balances the cost of the new subtrees while minimizing the cost of the edge cuts. It turns out (perhaps unsurprisingly) that there is a lot of CS research on algorithms for this problem. The algorithm I ended up choosing to implement is based off of the algorithm from Kundu and Misra found here. This is a linear time algorithm for solving the minimal cardinality variant of the optimal k-partition problem. The TLDR is that an optimal k-partition in this context is a set of partitions such that the total node weight of each partition is less than or equal to a user specified value k, and such that the weight of the cut edges is minimized. For the KM variant the algorithm is also designed to return a minimum cardinality partitioning. (Note that in general the KM algorithm doesn't minimize the weights of the edge cuts for generic edge weighting, but the monotonic edge weight structure present for our problem is a special case where it does).

The KM algorithm is efficient and simple, but the problem it solves is slightly different than the one we really need solved, which limits using straight out of the box. When looking for a partitioning for our purposes we typically have a desired number of partitions in mind (determined, e.g., by how many cores we have available). But because the nodes of our tree have nonuniform weights an optimal k-partition isn't guaranteed to have the particular number of components we're looking for. To address this I have added some additional heuristics. Here is a sketch of how it works:

Specify a target number of partitions. Perform the KM algorithm setting k=ceil(num_circuits/num_partitions). If the returned partitioning has the target number of partitions, sweet, we're done!
If the returned partitioning has too many partitioning we iteratively change the value of k and re-run the KM algorithm until we find a value of k for which the number of returned partitions is less than or equal to the target value. To search over the values of k we perform a bisection. If in the new candidate partitioning has the target number of partitions, noice, we're done!
Now that the number of partitions is less than the target we move onto the next heuristic. In this stage we rank each of the partitions by their weight. Then, while the number of partitions is still less than the target value, beginning with the heaviest partition we search through the nodes of this partition and identify the least expensive edge cut which when cut produces two subtrees with resultant weights that are as close to 50-50 as possible. We repeat this process as many times as is necessary to hit the target number of partitions (performing multiple passes if needed).
We now have the target number of partitions, and the additional redundancy is still kept at a minimum, but because of the previous heuristic we now likely have a rather imbalanced set of partitions. So the final heuristic is to perform what I call a 'rebalancing pass'. Here a use specifies a balance parameter which gives a desired maximum ratio between the heaviest and lightest partitions. We sort the partitions by their weight and then pair up the heavy and light nodes (heaviest to lightest, second heaviest to second lightest, etc.). For each pair of heavy and light partitions we search through the heavy paritions and look for the cheapest edge cut which we can make (adding the resulting subtree to the light partitions) which brings the ratio of the weights of the partitions below the specified threshold value. We then repeat the process until we either exhaust the list of pairs, or we find a pair of partitions with a sub threshold ratio of weights, at which point we're done.

In practice this all seems to work very well. The updated prefix table logic (fixing that bug) already gives a ~50% (3000-> 2000) and ~20% (12000-10000) reduction in number of state propagations needed for the test 1Q and 2Q experiment designs, respectively. In testing looking for partitions designed for 8 workers I found we were getting well balanced partitions with very few additional state propagations introduced (~20-30 for the 1Q case, and ~100 for the 2Q case). Anecdotally I also found that when comparing CPU utilization between the old implementation (i.e. the tip of develop) and the new one I was getting meaningfully better load balancing across workers (good for about 10-15% extra CPU utilization on my machine with the new one which points toward less idling time on lighter ranks).

Cython Microoptimizations

Nothing too groundbreaking here, but there are a bunch of cython-related microoptimizations. This includes toggling off wraparound and bounds checking where appropriate/safe, making use of cython memoryviews for direct numpy buffer access with lower overhead for indexing and assignment, adding in some previously unspecfied typing, and generally decreasing the amount of interaction and calls back into python code (there are tools in cython which identify these interactions for you).

In case you're curious about the impact of these changes, if the commit message I wrote at the time is to be trusted these changes in aggregate account for around a 10% speedup. Nothing amazing, but also basically free.

Miscellaneous

Also included:

Bugfix for circuits related to pickling (which is used by mpi4py internally for communication).
Bugfix for comparing LabelTup to LabelStr.
Spring cleaning for some old, no longer used code (plus some code that was previously marked for removal).
Fixed out of date and added missing documentation in lindbalderrorgen.py

Benchmarking Results:

All benchmarks run on a workstation laptop with a 6-core Intel(R) Xeon(R) W-10855M CPU and 32GB of RAM. Multithreaded tests were performed using pyGSTi's MPI-based parallelization with additional implicit parallelization from OpenMP and other packages turned off. For the single threaded tests this implicit parallelization was kept turned on (i.e. run with default settings).

All times are in seconds.

Jacobians:

The table below summarizes some before an after profiling for specifically the jacobian calculation. This timing information excludes the time associated with setup for the jacobian calculations related to layout creation (these set ups are do once use many times type computations and are amortized out in practice).

Single Threaded:

Caching?	Model Type	Num Qubits	Map (New Dense)	Map (New Sparse)	Matrix (New Sparse)	Matrix (New Dense)	Map (Old Dense)	Map (Old Sparse)	Matrix (Old Default Evotype)
Yes	TP	1	0.018	nd	nd	.057	.048	nd	.063
Yes	GLND	1	nd	0.116	.119	nd	nd	.527	.139
Yes	CPTPLND	1	nd	0.113	.118	nd	nd	.522	.119
Yes	TP	2	5.97	nd	nd	110.691	14.436	nd	276.203
Yes	GLND	2	13.402	21.932	100.424	151.452	333.816	151.207	nd
Yes	CPTPLND	2	13.451	21.178	150.025	150.24	334.734	131.554	nd
No	TP	1	0.067	nd	nd	nd	.085	nd	nd
No	GLND	1	0.91	nd	nd	nd	nd	1.297	nd
No	CPTPLND	1	0.864	nd	nd	nd	nd	1.299	nd
No	TP	2	29.727	nd	nd	nd	53.504	nd	nd
No	GLND	2	40.685	112.767	nd	nd	418.277	370.407	nd
No	CPTPLND	2	40.71	104.055	nd	nd	375.004	297.218	nd

End-to-end 2Q GST:

All results are for the CPTPLND parameterizations with a two-qubit experiment design for the XYCPHASE gate set with L=32 and a total of 10997 circuits.

Multithreaded:

Map (New Dense Evotype)	Matrix (Old Default Evotype)	Map (Old Default Settings)	Map (Old w/ Caching)
1366.3	8294.5	17470.6	16051.8

Single threaded:

Map (New Dense Evotype)
11879.3

Add in new methods to the OpModel class which add in support for updating one or a few parameters at a time. This reduces overhead of parameter updates by only updating modelmembers touched by those parameters.

This reworks the implementation of the map forward simulator's finite difference implementation to use the new lazier model parameter update methods, resulting in a significant speedup in jacobian calculations.

Take the previous partial vectorization of the from_vector method for coefficient blocks and fully vectorize it in order to speed it up even further. Also includes a specialized cython implementation for triangular indices generation

Updates the implementation of a tensor contraction operation that is used in jacobian calculations for the matrix forward simulator to use numpy's einsum with an optimal contraction strategy. In profiling this sped up the _dprobs_from_rho_e call by around a factor of 6 for the two-qubit test case.

Initial implementation of a method for identifying which model parameters a given circuit touches.

…el-param-updates

Add in a new method for OpModels which gets the parameter dependence for a set of circuits. This is then used in the construction of a new parameter dependence aware prefix table implementation tailored for jacobian calculations that reduces the amount of redundant computation performed. Update the generic map forward simulator calculation module to use this new (more efficient) codepath.

This is the initial pass at getting the updated cython implementation for parameter dependence consideration in jacobian calculations working.

A bunch of cython microoptimization: Disabling wraparound where appropriate, more typing, reduced python object interaction, use of typed memoryviews, etc. Additionally cleans up some comments and fixes an annoying compiler warning coming from the setup config.

Detailed profiling of the new parameter dependence logic in the cython code suggests that at best we break even with the older implementation using caching. As such, it is better to revert this implementation. We'll also want to detect when cython is being used to avoid building the jacobian prefix tables here, as they won't be getting used and they are an expensive upfront cost. Keep all of the cython microoptimizations and compiler directives though, as these seem to amount to about a 10% net win once the rest of the extra logic has been stripped.

Introduces a number of re-implementations and refactors for the PrefixTable class. This includes: - A number of new methods for quantifying the cost and performance of constructed evaluation strategies - Fixes to broken logic that resulted in suboptimal prefix identification. - A new implementation of the splitting logic for constructing atoms which leverages tools from graph theory to build more efficient distributions of work among workers.

Includes a number of performance improvements and refinements to the implementation of the KM tree partitioning algorithm. Changes include: - More efficient re-use of computed subtree weights and level partitions - A custom copying function that avoids the use of the incredibly slow deepcopy function. - Less copying in general by changing when graph modifications are applied. -Bisection instead of linear search for getting initial KM partition.

Change the default atom count heuristic so that only a single atom is created when there is a single processor and no memory limit.

Cleans up the lindbladerrorgen.py module. -Removes large blocks of old commented out code and debug statements. - Adds new docstrings for methods that were previously missing them. - First pass at bringing the existing docstrings up to date with current implementation.

Update the implementation of the error generator representation update code for the dense rep case. The results are functionally identical, but are measurably faster. (Einsum is ~2-3X faster than tensordot for this particular case, e.g.). We also now do the entire error generator construction in a single shot instead of block-by-block to get additional benefits from vectorization.

The start of composed effect was 300 lines of old commented out implementation. This commit is simply to remove that fluff.

Refactor the single parameter `set_parameter_value` method to call the multi-parameter implementation under the hood. Add additional performance tweaks to the logic for determining when to update an element of the cache. What I came up with was that when the layer rules are the ExplicitLayerRules and we are updating an effect which is known to belong to a POVM which has already been updated then we can skip the cache update for those effects.

Add a few tweaks to the PrefixTable splitting algorithm to support multiple native state preps. Also fixes a bug for __contains__ comparisons between LabelTupTup and LabelStr. Add support for setting model parameter values by their parameter labels.

Can now specify a parameter label in addition to an integer index for model parameter updates.

Fixes and inefficiency in dm_mapfill_probs which was resulting in effect reps being recalculated unnecessarily, which was a big performance penalty, especially for composed error generator type reps.

Slightly more efficient parity implementation (fewer operations), and add the compiler hint to inline the parity function. In profiling this makes a surprisingly big difference.

Adds in pickle handling for circuits to address an issue with hash randomization.

This commit makes changes to the implementation of the new prefix table splitting algorithm in an attempt to make it deterministic. Previously we made direct use of a number of networkx iterators which turned out to be set like, and as such they had nondeterministic order for their returned values. This is fine in the single threaded setting, but with MPI this meant we would end up with different splittings for the different ranks (I expect there to be many degenerate solutions to the splitting problem, so this isn't a crazy thing to see). This resulted in bugs when using MPI. Hopefully this should fix things...

This reverts commit 5f34a9a.

Add checks for the existence of a model parameter interposer when using the circuit parameter dependence code. Currently that option is not supported.

Attempt at updating the default atom count heuristic to favor having the same number of atoms as processors. Should be revisited at some point to confirm it performs as anticipated.

Remove some profiling and debug bindings for cython extensions.

This is my first pass at updating the default evotype behavior for casting so that we prefer dense representations when using a small number of qubits. Right now the threshold is arbitrarily set to 3 qubits, but this should be reevaluated as needed.

Fix the evotype casting I accidentally broke with typo.

Change the default maximum cache size from 0 to None for the map forward simulator.

eendebakpt · 2024-10-31T20:22:25Z

@coreyostrove Nice to see such speedups! I tried the PR in the current state with full TP mode. Profiling results show quite some time spend in threading.wait. Is there a way to disable threads? It is not clear to me where these threads originate from (searching for threading in the pysgti codebase does not yield any calls).

For CPTP I cannot run GST any longer. The example from the 2-qubit notebooks results in: ValueError: Could not interpret 'CPTPLND' mode as a parameterization!. A minimal example:

import pygsti
from pygsti.modelpacks import smq2Q_XY
import time

target_model = smq2Q_XY.target_model('full TP')
exp_design = smq2Q_XY.create_gst_experiment_design(max_max_length=2, fpr=True)
pygsti.io.write_empty_protocol_data("example_files/My2QExample", exp_design, clobber_ok=True)
mdl_datagen = target_model.depolarize(op_noise=0.1, spam_noise=0.01)
pygsti.io.fill_in_empty_dataset_with_fake_data("example_files/My2QExample/data/dataset.txt",
                                               mdl_datagen, num_samples=1000, seed=2020)

data = pygsti.io.read_data_from_dir("example_files/My2QExample")

m = "CPTPLND"
protocol = pygsti.protocols.StandardGST(m, optimizer={'tol': 1e-2}, verbosity=2)
results = protocol.run(data, memlimit=5*(1024)**3)

coreyostrove · 2024-11-05T17:44:19Z

@coreyostrove Nice to see such speedups! I tried the PR in the current state with full TP mode. Profiling results show quite some time spend in threading.wait. Is there a way to disable threads? It is not clear to me where these threads originate from (searching for threading in the pysgti codebase does not yield any calls).

For CPTP I cannot run GST any longer. The example from the 2-qubit notebooks results in: ValueError: Could not interpret 'CPTPLND' mode as a parameterization!. A minimal example:
import pygsti
from pygsti.modelpacks import smq2Q_XY
import time

target_model = smq2Q_XY.target_model('full TP')
exp_design = smq2Q_XY.create_gst_experiment_design(max_max_length=2, fpr=True)
pygsti.io.write_empty_protocol_data("example_files/My2QExample", exp_design, clobber_ok=True)
mdl_datagen = target_model.depolarize(op_noise=0.1, spam_noise=0.01)
pygsti.io.fill_in_empty_dataset_with_fake_data("example_files/My2QExample/data/dataset.txt",
                                               mdl_datagen, num_samples=1000, seed=2020)

data = pygsti.io.read_data_from_dir("example_files/My2QExample")

m = "CPTPLND"
protocol = pygsti.protocols.StandardGST(m, optimizer={'tol': 1e-2}, verbosity=2)
results = protocol.run(data, memlimit=5*(1024)**3)

Thanks for the detailed reply. Could you confirm whether you've already rebuilt your cython extensions? The exception you're hitting is in a try-except block which likely hits the cython code and I suspect as a result the original exception is being transformed into what you're seeing. We should probably have that try-except block pass the original exception as well, so it is good to have caught this. If that isn't the issue let me know, but I ask in part because I wasn't able to recreate the error you encountered on my end.

If you did want to continue playing around with this branch I'd recommend checking out the commit 383e4bd rather than working off of the tip of the branch for the time being. One of the proceeding commits which changed the default evolution type behavior to prefer dense representations revealed a bug that crops up in combination with the MatrixForwardSimulator class which is still being tracked down (main reason why this hasn't been promoted from a draft PR yet). This shouldn't affect the MapForwardSimulator, and I'd recommend making sure you're explicitly setting the model's forward simulator to this class when doing your testing to see the biggest improvements (setting the maximum cache size to None, and setting the prefer_dense_reps attribute for the evotype to True at model construction time). This will be made less clunky once we've fixed that aforementioned bug.

eendebakpt · 2024-11-05T20:00:07Z

@coreyostrove Rebuilding the cython extensions indeed resolves the issue. I updated the example with

target_model.evotype.prefer_dense_reps=True
simulator= MapForwardSimulator(max_cache_size=None, model=target_model)
results = protocol.run(data, memlimit=8*(1024)**3, simulator=simulator)

to make sure the MapForwardSimulator is used.

Thanks!

rileyjmurray · 2024-11-06T21:36:13Z

@coreyostrove can you share whatever scripts you have to run your benchmarks?

coreyostrove · 2024-11-06T22:30:51Z

Sure thing. Here is the notebook used for generating the entries of the Jacobian table:
final_to_orig_profiling.zip

And here are the scripts used for the end-to-end two-qubit GST testing (with MPI):
end_to_end_two_qubit_GST.zip

Fixes a couple bugs in LindbladCoefficientBlock. The first was related to an old parameter name that was missed in a previous refactor. The second was a bug in the superop hessian calculation for the other block. Erik fixes this for cholesky parameterization, but the same fix also needed to be added for elements. Unrelatedly, some documentation has been updated.

Fix a bug I introduced in basepovm. The effect vectors don't always have a state space attribute initially, so the new logic accounts for when they are initially lists or arrays.

rileyjmurray

I left a few minor comments. Once addressed, please ping me and I'll approve!

pygsti/evotypes/densitymx/effectcreps.cpp

pygsti/forwardsims/mapforwardsim.py

pygsti/forwardsims/matrixforwardsim.py

pygsti/evotypes/evotype.py

Remove some commented out code, and fix an incorrect docstring.

rileyjmurray

LETS GOOOOO

coreyostrove · 2024-12-12T19:50:37Z

Thanks for the review, @rileyjmurray! I've pushed a commit addressing your comments, and have marked as resolved those addressed.

sserita

Thanks for this massive effort @coreyostrove!

sldesnoo-Delft · 2025-01-09T14:24:42Z

Hi @coreyostrove,
A dependency on package networkx has been added in commit 67ae76b
This dependency is missing in the required packages in the setup configuration.

coreyostrove · 2025-01-10T20:05:46Z

Hi @coreyostrove, A dependency on package networkx has been added in commit 67ae76b This dependency is missing in the required packages in the setup configuration.

Thanks for the heads up, @sldesnoo-Delft!

Corey Ostrove added 25 commits September 10, 2024 22:15

Initial Implementation For Few Parameter Updates

0c696a7

Add in new methods to the OpModel class which add in support for updating one or a few parameters at a time. This reduces overhead of parameter updates by only updating modelmembers touched by those parameters.

Update MapForwardSimulator to use lazier updates

31f99c5

This reworks the implementation of the map forward simulator's finite difference implementation to use the new lazier model parameter update methods, resulting in a significant speedup in jacobian calculations.

Method For Circuit Parameter Dependence

f14ecec

Initial implementation of a method for identifying which model parameters a given circuit touches.

Merge branch 'feature-faster-layout-creation' into feature-lazier-mod…

4e275fa

…el-param-updates

Initial pass at updated cython implementation

4b44ed5

This is the initial pass at getting the updated cython implementation for parameter dependence consideration in jacobian calculations working.

Change default atom heuristic

b69c44e

Change the default atom count heuristic so that only a single atom is created when there is a single processor and no memory limit.

Spring cleaning

6c076e7

Cleans up the lindbladerrorgen.py module. -Removes large blocks of old commented out code and debug statements. - Adds new docstrings for methods that were previously missing them. - First pass at bringing the existing docstrings up to date with current implementation.

Clean up composedeffect

3f59ec6

The start of composed effect was 300 lines of old commented out implementation. This commit is simply to remove that fluff.

Add option to update parameters by name

d6deda1

Can now specify a parameter label in addition to an integer index for model parameter updates.

Fix an inefficiency in dm_mapfill_probs

5f34a9a

Fixes and inefficiency in dm_mapfill_probs which was resulting in effect reps being recalculated unnecessarily, which was a big performance penalty, especially for composed error generator type reps.

Minor tweak to effectcrep

81ced99

Slightly more efficient parity implementation (fewer operations), and add the compiler hint to inline the parity function. In profiling this makes a surprisingly big difference.

Add pickle handling for circuits

5b0b2f4

Adds in pickle handling for circuits to address an issue with hash randomization.

Revert "Fix an inefficiency in dm_mapfill_probs"

c14b090

This reverts commit 5f34a9a.

Merge branch 'develop' into feature-lazier-model-param-updates

119e968

coreyostrove added this to the 0.9.13 milestone Oct 7, 2024

coreyostrove self-assigned this Oct 7, 2024

coreyostrove mentioned this pull request Oct 21, 2024

Problems loading old ProtocolDatas with new develop #498

Closed

Corey Ostrove added 2 commits October 21, 2024 20:24

Add parameter interposer checks

12c9252

Add checks for the existence of a model parameter interposer when using the circuit parameter dependence code. Currently that option is not supported.

Update the default atom heuristic

ce983fc

Attempt at updating the default atom count heuristic to favor having the same number of atoms as processors. Should be revisited at some point to confirm it performs as anticipated.

Corey Ostrove added 4 commits October 21, 2024 21:37

Remove some debug bindings

383e4bd

Remove some profiling and debug bindings for cython extensions.

Fix cast bug

bdcca50

Fix the evotype casting I accidentally broke with typo.

Change default cache size

ec27fa7

Change the default maximum cache size from 0 to None for the map forward simulator.

coreyostrove mentioned this pull request Oct 23, 2024

Future MapForwardSimulator Performance Improvements #501

Open

Corey Ostrove added 3 commits December 11, 2024 11:02

Merge branch 'develop' into feature-lazier-model-param-updates

3572ad3

Fix state space bug

910715a

Fix a bug I introduced in basepovm. The effect vectors don't always have a state space attribute initially, so the new logic accounts for when they are initially lists or arrays.

coreyostrove marked this pull request as ready for review December 12, 2024 01:32

coreyostrove requested review from a team and rileyjmurray as code owners December 12, 2024 01:32

coreyostrove requested review from kmrudin and sserita December 12, 2024 01:32

rileyjmurray reviewed Dec 12, 2024

View reviewed changes

pygsti/evotypes/densitymx/effectcreps.cpp Outdated Show resolved Hide resolved

pygsti/forwardsims/mapforwardsim.py Outdated Show resolved Hide resolved

pygsti/forwardsims/matrixforwardsim.py Outdated Show resolved Hide resolved

pygsti/evotypes/evotype.py Show resolved Hide resolved

Spring cleaning

cf2cef1

Remove some commented out code, and fix an incorrect docstring.

rileyjmurray approved these changes Dec 12, 2024

View reviewed changes

sserita approved these changes Dec 16, 2024

View reviewed changes

sserita merged commit 5e2465f into develop Dec 16, 2024
4 checks passed

sserita deleted the feature-lazier-model-param-updates branch December 16, 2024 21:28

coreyostrove mentioned this pull request Dec 17, 2024

Fwdsim patches #517

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making 2Q GST Go Vroom Vroom #496

Making 2Q GST Go Vroom Vroom #496

coreyostrove commented Oct 7, 2024 •

edited

Loading

eendebakpt commented Oct 31, 2024

coreyostrove commented Nov 5, 2024

eendebakpt commented Nov 5, 2024

rileyjmurray commented Nov 6, 2024

coreyostrove commented Nov 6, 2024 •

edited

Loading

rileyjmurray left a comment

rileyjmurray left a comment

coreyostrove commented Dec 12, 2024

sserita left a comment

sldesnoo-Delft commented Jan 9, 2025

coreyostrove commented Jan 10, 2025

Making 2Q GST Go Vroom Vroom #496

Making 2Q GST Go Vroom Vroom #496

Conversation

coreyostrove commented Oct 7, 2024 • edited Loading

Summary

Guiding Principle:

Code Updates:

Lazier Model Parameter Updates

Faster Model Parameter Updates

Circuit Parameter Dependence

PrefixTable Costing

PrefixTable Inefficiency Fix and New Splitting Algorithm

Cython Microoptimizations

Miscellaneous

Benchmarking Results:

Jacobians:

End-to-end 2Q GST:

eendebakpt commented Oct 31, 2024

coreyostrove commented Nov 5, 2024

eendebakpt commented Nov 5, 2024

rileyjmurray commented Nov 6, 2024

coreyostrove commented Nov 6, 2024 • edited Loading

rileyjmurray left a comment

Choose a reason for hiding this comment

rileyjmurray left a comment

Choose a reason for hiding this comment

coreyostrove commented Dec 12, 2024

sserita left a comment

Choose a reason for hiding this comment

sldesnoo-Delft commented Jan 9, 2025

coreyostrove commented Jan 10, 2025

coreyostrove commented Oct 7, 2024 •

edited

Loading

coreyostrove commented Nov 6, 2024 •

edited

Loading