diff --git a/content/20.data_objects.md b/content/20.data_objects.md index ae95292..8532b71 100644 --- a/content/20.data_objects.md +++ b/content/20.data_objects.md @@ -25,7 +25,7 @@ The final mechanism by which data is selected is for discrete data points, typic At present, this is done by first identifying which data files intersect with a given selector, then selecting individual points. There is no hierarchical data selection conducted in this system, as we do not yet allow for re-ordering of data on disk or in-memory which would facilitate hierarchical selection through the use of operations such as Morton indices. -### Selection Routines +### Selection Routines {#sec:selection_routines} Given these set of hierarchical selection methods, all of which are designed to provide opportunities for early-termination, each *geometric* selector object is required to implement a small set of methods to expose its functionality to the hierarchical selection process. Duplicative functions often result from attempts to avoid expensive calculations that take into account boundary conditions such as periodicity and reflectivity unless necessary. diff --git a/content/30.abstracting_simulation_types.md b/content/30.abstracting_simulation_types.md index b762ab5..a2d148d 100644 --- a/content/30.abstracting_simulation_types.md +++ b/content/30.abstracting_simulation_types.md @@ -19,7 +19,7 @@ This chunking type is the most common strategy for parallel-decomposition. Necessarily, both indexing and selection methods must be implemented to expose these different chunking interfaces; `yt` utilizes specific methods for each of the primary data types that it can access. We detail these below, specifically describing how they are implemented and how they can be improved in future iterations. -### Grid Analysis +### Grid Analysis {#sec:grid_analysis} ![The grid structure of the simulation `IsolatedGalaxy`](){#fig:grid_organization} diff --git a/content/68.future_directions.md b/content/68.future_directions.md index 335a8a0..fd3871a 100644 --- a/content/68.future_directions.md +++ b/content/68.future_directions.md @@ -1,10 +1,47 @@ ## Future Directions +### Improvements to Internal Systems {#sec:improvements_to_internal_systems} + +**Optimization** + +The internal systems that conduct selection, caching and IO optimization, data processing and parallel load distribution have been designed for general purpose application. +While this enables code reuse as well as consistent API patterns, the methods used to implement these systems internally in `yt` have not always kept up with the optimizations available. +For example, the data selection routines (described in @sec:selection_routines) are not uniformly optimized to take advantage of built-in organizational information from grid, octree and particle data. +A particular example is the [quad-tree projection](#sec:dobj-quad_proj). +This projection method can be optimized for octree datasets for some speedups and memory improvements (which would likely then be shared with grid patch datasets) but the maintenance and implementation costs are not currently balanced in favor of that change. +Future iterations on `yt` will necessarily need to take these possible optimizations into account in order to meet the needs of increasing dataset size and complexity. + +Other, more mundane optimizations can also be applied. +While utilizing units for all array-based operations (see @sec:units) provides safety and clarity of the quantities being manipulated, it also introduces overhead from the symbolic manipulation of those units. +For example, typically the units used for position inside `yt` are all in the internal "index" space of the dataset. +For some cosmological simulations, these are normalized to 1.0; for other simulations, they may be in centimeters or kilometers. +(And often in cosmology there is some factor of the Hubble constant somewhere in the units, which is typically *also* an input parameter to the simulation.) +However, we cannot guarantee that all input coordinates to selection routines match this index space; in fact, doing so would render utilizing a unit library unnecessary. +As a result, `yt` conducts as regularization of units inside selection routines to ensure that they are the same in the quantities being compared or evaluated. +This requires symbolic math operations in SymPy, which can at times carry with them substantial overhead. +Often, even verifying that units are identical requires expensive operations to be conducted. +To alleviate this, providing some measure of immutable units or index-guaranteed units (and thus enabling the unit comparison process to be elided) would eliminate many expensive operations inside selection routines. +This would likely have the biggest impact on operations like ghost-zone generation, which can be quite expensive. + +Another high-impact possible optimization is a conversion of the underlying infrastructure used for grids (@sec:grid_analysis) into a more compact and spatially-aware data structure. +Specifically, utilizing an approach that uses the inherent spatial organization of a patch-based grid dataset (often in the form of an R-tree) can reduce the memory overhead of grid storage. +While work has begun on this, modeling the grid-infrastructure after the "visitor" pattern used in the octree infrastructure, differences between the two (such as irregular sizes, different refinement patterns per level, etc) has presented some difficulties. +However, given a successful implementation, much of the code that provides access to IO and selection (i.e., @sec:chunking) should be able to be moved into optimized, tight-loop routines written in Cython, Rust or other lower-level languages. +Utilizing these data structures will also enable access to bitmap-arrays for caching of data selection results, reducing overall memory usage and improving performance. + +**Testing infrastructure** +#sec:unit_testing +#sec:answer_testing + +### New Features {#sec:new_features} + - More integration with _in situ_ analysis systems like `libyt` -- Much improved optimization -- Integration with other domains besides astronomy -- Refactoring for the long term +- Forward-looking utilization of accelerators and other array-oriented programming +- Integration with external libraries such as pytorch-spatial, etc +- Further integration with other domains besides astronomy + +### Quality of Life {#sec:quality_of_life} + - Static typing +- Refactoring for the long term - Improving visual representation of `yt` objects -- Testing infrastructure -- Integration with external libraries such as pytorch-spatial, etc