Skip to content

Latest commit

 

History

History
81 lines (62 loc) · 3.88 KB

howto_virt_dirs.md

File metadata and controls

81 lines (62 loc) · 3.88 KB

Virtual directories

Unlike hierarchical POSIX, object storage is flat, treating forward slash ('/') in object names as simply another symbol.

But that's not the entire truth. The other part of it is that user may want to operate on (ie., list, load, shuffle, copy, transform, etc.) subset of objects in a dataset that, for the lack of better word, looks exactly like a directory.

In fact, user often wants to do exactly that. Train, for instance, on all audio files under en_es_synthetic/v1/train/, or similar.

In object storages, the term for quote/unquote "what looks like a directory" is virtual directory or synthetic directory.

The motivation may become clearer if I say that the entire real-life dataset contains many millions of objects and numerous virtual directories, including the aforementioned en_es_synthetic/v1/train/.

Needless to say, aistore provides for all of that and more. There is a certainty subtlety, however, that makes sense to illustrate on examples.

But first, the rules

  • normally, remote backends do not return virtual directories, with two exceptions:
    • list-objects operation is non-recursive (API apc.LsNoRecursion in the control message, CLI --nr switch);
    • the bucket in question contains some sort of special directory that shows up anyway (e.g. bucket inventory).
  • list-objects will always return virtual directories, assuming:
    • the corresponding backend's response includes those (see above), and
    • user does not specify apc.LsNoDirs (CLI --no-dirs)
  • the output is always sorted alphanumerically, directories-first

Examples

1. Show everything that has a certain prefix

$ ais ls s3://speech --prefix .inventory
NAME                                                                      SIZE            CACHED
.inventory/speech/data/
.inventory/speech/2024-05-31T01-00Z/manifest.checksum                     33B             no
.inventory/speech/2024-05-31T01-00Z/manifest.json                         406B            no
.inventory/speech/data/985fc9cb-5957-4fc8-b26d-092685a747e8.csv.gz        54.14MiB        no
.inventory/speech/data/9dac8de5-cff9-432c-9663-b054ae5ce357.csv.gz        54.14MiB        no
.inventory/speech/hive/dt=2024-05-30-01-00/symlink.txt                    85B             no
.inventory/speech/hive/dt=2024-05-31-01-00/symlink.txt                    85B             no

2. Same as above using familiar *nix notation

$ ais ls s3://speech/.inventory

3. Same as above, without directories

$ ais ls s3://speech --prefix .inventory --no-dirs
NAME                                                                      SIZE            CACHED
.inventory/speech/2024-05-31T01-00Z/manifest.checksum                     33B             no
.inventory/speech/2024-05-31T01-00Z/manifest.json                         406B            no
.inventory/speech/data/985fc9cb-5957-4fc8-b26d-092685a747e8.csv.gz        54.14MiB        no
.inventory/speech/data/9dac8de5-cff9-432c-9663-b054ae5ce357.csv.gz        54.14MiB        no
.inventory/speech/hive/dt=2024-05-30-01-00/symlink.txt                    85B             no
.inventory/speech/hive/dt=2024-05-31-01-00/symlink.txt                    85B             no

4. Show dataset structure at a certain nested depth

$ ais ls s3://speech --prefix .inventory/speech/ --nr
NAME                                       SIZE    CACHED
.inventory/speech/2024-05-31T01-00Z/
.inventory/speech/data/
.inventory/speech/hive/

Virtual directories and data at a certain level

$ ais ls s3://speech --prefix .inventory/speech/data/ --nr
NAME                                                                      SIZE            CACHED
.inventory/speech/data/
.inventory/speech/data/985fc9cb-5957-4fc8-b26d-092685a747e8.csv.gz        54.14MiB        no
.inventory/speech/data/9dac8de5-cff9-432c-9663-b054ae5ce357.csv.gz        54.14MiB        no