diff --git a/RFC0021.md b/RFC0021.md new file mode 100644 index 0000000..f734bec --- /dev/null +++ b/RFC0021.md @@ -0,0 +1,99 @@ +# RFC0021 + +## Title +PMIx Support for Storage Systems + +## Abstract +Exascale systems expect to access information that resides in an array of storage media, ranging from offline archives to streaming data flows. This "tiered storage" architecture presents a challenge to application developers and system managers striving to achieve high system efficiency and performance. Multiple vendor-specific packages have been developed, each with its own unique API and associated data structures. However, this results in a corresponding loss in application portability and increased cost of customer migration across procurements. + +These API and attribute definitions are based on recognition that gaining multi-vendor agreement on common interfaces and data structures is a difficult and long-term objective. Thus, this RFC proposes a more flexible approach that allows vendor independence by defining an abstraction layer for passing storage-related requests and directives based on PMIx APIs and data structures. + +## Labels +[EXTENSION][CLIENT-API][SERVER-API][RM-INTERFACE] + + +## Action +[IN PROGRESS] + + +## Copyright Notice +Copyright 2017-2018 Intel, Inc. All rights reserved. + +This document is subject to all provisions relating to code contributions to the PMIx community as defined in the community's [LICENSE](https://github.com/pmix/RFCs/tree/master/LICENSE) file. Code Components extracted from this document must include the License text as described in that file. + +## Description +We can debate the levels, but for now we define a set of *temperatures* for files as follows: + +* **hot** means the data is on the node. This could be in main memory or an attached NVRAM, or even a local disk should the node have one +* **warm** means network-local - cached on storage attached to the switch the node connects to. +* **cool** means the data is in the online storage or file system. It is not sitting in an offline archive. +* **cold** means it is in an offline archive that requires someone to load it into online storage prior to accessing + +Our vision for this subsystem is to instantiate it as a set of plugins - e.g., one for Lustre and another for a "cold" storage tape system - all of which might be active at the same time. All function calls will be executed by scanning across all active plugins to give each of them a chance to respond. Responses will be aggregated before being returned to the caller. Each plugin, therefore, is only required to provide information for things it knows - e.g., a request for time required to bring a *cold* file online might be answered by the storage tape plugin but generate no response from Lustre. + +Initial thoughts on required functionality include the following areas: + +--------- + +##### Accessibility +Given a list of files and a uid/gid (or credential), return their accessibility status. The objectives of the call are to: + * discover any issues with the requested access - e.g., no point in scheduling a job if the user isn't allowed to access the required files + * to provide the scheduler with some idea of when the data might become available. For example, if everything is *hot*, then the scheduler might look at using the nodes those files are on for the requested job. + +Because the status can involve multiple descriptive elements and take a while to complete, this will be done using a non-blocking API that will callback when complete and return an array of pmix_info_t structs for each of the given files. Possible returned information includes: +* storage temperature +* for files in *cold* storage, an estimated time for retrieving them. This should be the time required to bring them into the *online* file system +* for *cool* files, we might want the time to bring those files to the *surface* of the file system - i.e., the IO nodes in a typical cluster configuration. If this is usually measured in milliseconds, then no need - if it is measured in seconds or more, than we probably want to know about it. +* for *warm* and *hot* files, where those files are located. Again, a given plugin might not know the answer, and that's okay - it only returns info that it knows. The scheduler may already know about them. If nobody knows, then that's okay too - we did our best. + * Access denied - we will define a set of error codes to return that will convey some idea of why + +--------- + +##### Queries +Respond to storage-related queries. We will pass the plugins a list of defined attributes indicating the desired info. Some thoughts on possible requests include: + * Amount of available storage - just a snapshot in time, of course. I expect we'll see refinement of this attribute as people will want total available storage, storage available to a specific uid or gid, available storage of a specified type, etc. + * Supported storage strategies - e.g., striping patterns, hot/warm/cold management + * Default storage strategy - what the storage system will do with the data without further direction + * Unit of reservation - e.g., block size + * Supported QoS levels - may be none, but that's an okay response + * Available bandwidth - if we can't specify a QoS level, then can we limit someone's bandwidth within the file system "black box"? + * Storage system topology - if the storage has burst buffers, in-cluster caching, or other layering under their control, the scheduler would like to know about it. We'll have to think about how to construct this response + * Support for co-located process placement - e.g., if the Hadoop application wants to execute something local to the file inside the storage, can they support it? I suspect the answer will generally be "no", but this allows some wiggle room - e.g., maybe they can bring the file to a burst buffer location under their control and execute their app there. + +--------- + +##### Execute +Given a list of files, execute the given directives. We may break this into multiple APIs, but the general idea is to provide an application with the ability to direct file system actions. Caller would have the given callback function executed upon completion or error. Example action directives: + * Delete the specified files + * Move the files to a specified location. We would also allow modifiers to be added to this request - e.g., when to move them (immediately, upon job completion, etc.), movement priority (e.g., QoS or bandwidth to use, or "in background" to minimize jitter), and access limitations once moved + * Bring the files to the specified storage *temperature*, executing a given callback function when complete. This is mainly intended for bringing *cold* files online (i.e., to *cool* status), but could be extended to direct that files stored on disks be brought into memory caches. Temperatures such as *warm* and *hot* will be accompanied by a location attribute. Also, would have attributes to define access limitations for the files once they are positioned. + +--------- + +##### Estimate Cost +Given a list of files and target locations, provide an estimated *cost* (usually the time) for moving the file to that location. This is something the scheduler folks would like to have for their algorithms - it can impact the selection of nodes, for example. The possible directives match those in #3 as we might want the *cost* for any of those operations in advance of ordering them. + +--------- + +##### Allocate Resources +Given a specified job, allocate a given set of resources for its use. This would include bandwidth, storage size, location, *temperature*, and other possible values. Can be called by the scheduler when prepping a job, or by an application that seeks resources. Modifiers might include directive that they should return an error if they cannot immediately meet the request or provide an estimated time when the request might be met or notify the caller when the allocation has been made. Once the allocation is made, they'll need to include information on how the caller accesses those resources in the notification. + +--------- + +##### Specify Policies +Given a list of files (default to all files in the caller's job if NULL), specify storage policies for them. This is different from the above allocation request - we aren't asking for an allocation here, but instead specifying how we want these files handled. Envisioned directives include the ability for applications to influence location, relocation, and storage strategies (e.g., striping across multiple locations, hot/warm/cold storage) of checkpoints and other data + +--------- + +##### Notification of Events +Note that we are not asking for RAS data - we just want to know of events that impact the operational ability of the storage system. There are a couple of options here: the storage system could directly generate PMIx events, or it can provide a "hook" whereby the local PMIx server can trap a storage notification (e.g., via a callback or polling a pipe) and turn it into a PMIx event for the scheduler and others to consume. + +--------- + +## Protoype Implementation +Provide a reference link to the accompanying Pull Request (PR) against the PMIx master repository. If the prototype implementation has been tested against an appropriately modified resource manager and/or client program, then references to those prototypes should be provided. Note that approval of any RFC will be far more likely to happen if such validation has been performed! + +## Author(s) +Ralph H. Castain +Intel, Inc. +Github: rhc54