Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry pick and improve from Robbie Morrison's suggestions #52

Merged
merged 10 commits into from
Nov 1, 2023
6 changes: 3 additions & 3 deletions Chapters/1.Scope.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This SoftWare Hash IDentifier (SWHID) specification
defines a standard data format for referencing digital artifacts that
fit in the data model of modern distributed version control systems.
match the data model of modern distributed version control systems.

This includes the typical tree-like structure of a filesystem hierarchy,
but also special nodes to track revisions and releases, as well as the
Expand All @@ -11,11 +11,11 @@ branches.

A key property of SWHIDs is that they can be computed using cryptographically
strong functions directly from the digital objects they refer to, by anyone that
has access to a copy of them. This enables decentralised and independent
has access to a copy of those objects. This enables decentralised and independent
verification of integrity, without relying on a registry or a central authority.

The computation of the SWHID identifiers is based on Merkle Acyclic Directed
Graphs, a natural generalization of Merkle trees.

The resolution of SWHIDs, i.e. the process of obtaining a copy of a digital
The resolution of SWHIDs, that is, the process of obtaining a copy of a digital
artifact corresponding to a given SWHID, is out of the scope of this specification.
21 changes: 17 additions & 4 deletions Chapters/3.Terms_and_definitions.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
t
# 3 Terms and definitions

For the purposes of this document,
Expand All @@ -21,15 +22,15 @@ Git is a distributed version control system created by Linus Torvalds in 2005. I

## 3.3 hierarchical file system

A hierarchical file system is a method of organizing and managing files in a computer where data is stored hierarchically (in a structure often visualized as a tree). It uses directories (or 'folders') to organize files into a tree structure. Each directory can contain more files and directories, thus forming a hierarchical structure.
A hierarchical file system is a method of organizing and managing files in a computer where data is stored hierarchically. It uses directories (or 'folders') to organize files into a tree structure. Each directory can contain more files and directories, thus forming a hierarchical structure.

## 3.4 intrinsic identifier

An identifier that can be computed directly from the object that it identifies, without needing a registry. Typical examples are cryptographically strong hashes.
An identifier that can be computed directly from the object that it identifies, without needing access to a registry. Typical examples are cryptographically strong hashes.

## 3.5 repository

In the context of version control systems, a repository is a storage location for software development artifacts including but not limited to source code, build scripts, documentation, etc. It often includes metadata about the stored items, such as version number, author, date of the last modification, etc. Repositories can be local or remote and are managed by version control systems like Git.
In the context of version control systems, a repository is a storage location for software development artifacts including but not limited to source code, build scripts, and documentation. It often includes metadata about the stored items, such as version number, author, and date of the last modification. Repositories can be local or remote and are managed by version control systems like Git.

## 3.6 SHA1

Expand All @@ -49,4 +50,16 @@ Note that in most cases SHA1 in this specification are computed on objects after

## 3.7 version control system

A version control system (VCS), also known as source control or revision control, is a software tool that helps manage different versions of software development artifacts. It keeps track of all changes made to the code, allows multiple developers to work on the same codebase, and provides mechanisms for merging changes, reverting changes, and branching and merging of code. Examples include Git, Mercurial, and Subversion.
A version control system (VCS), also known as source control or revision control, is a software tool that helps manage different versions of software development artifacts. It keeps track of all changes made to the code, allows multiple developers to work on the same codebase, and provides mechanisms for merging changes, reverting changes, and the branching and merging of code. Examples include Git, Mercurial, and Subversion.

## 3.8 software object, software artifact

A software object, also referred to as a software artifact, represents a distinct entity identifiable by a SWHID. This entity can be as granular as a single line of code within a source file or as expansive as an entire codebase comprising multiple source files. In addition to source files, a software object can also be a binary file resulting from code compilation or multiple binary files linked together to produce an executable file.

## 3.9 metadata

Within the context of this specification, metadata refers to supplementary information associated with a software object. It serves to provide a deeper understanding of the object by detailing attributes such as the programming language used, its functionality, or its dependencies. Metadata can also enumerate the individuals involved in the software's development, elucidate its licensing terms, offer a record of version history, and more. Essentially, metadata encapsulates the broader context, provenance, and attributes of the software object, ensuring a comprehensive understanding of its nature and purpose.

## 3.10 UNIX epoch

The UNIX epoch is a time reference point that denotes the precise moment at 00:00:00 Coordinated Universal Time (UTC) on 1 January 1970. In UNIX-based systems, time is often represented as the total number of seconds that have transpired since this specific moment. This convention is widely used in computing for time-stamping and date-time representations.
2 changes: 1 addition & 1 deletion Chapters/4.Syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,5 +53,5 @@ The last two symbols are defined as:

In both of these, all occurrences of `;` (and `%`, as required by the RFC)
have been percent-encoded (as `%3B` and `%25` respectively). Other
characters *may* be percent-encoded, e.g., to improve readability and/or
characters *may* be percent-encoded, for example, to improve readability and/or
embeddability of SWHID in other contexts.
14 changes: 7 additions & 7 deletions Chapters/5.Core_identifiers.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ computed from the content and relevant metadata of the object.

A *content* is an uninterpreted byte sequence, typically, the content of a file.
For this type of object the intrinsic identifier is the `sha1_git` hash of it,
i.e. the SHA1 of the byte sequence obtained by juxtaposing
that is, the SHA1 of the byte sequence obtained by juxtaposing:

- the ASCII string `"blob"` (4 bytes),
- an ASCII space,
Expand Down Expand Up @@ -79,7 +79,7 @@ a given point in time of its development on May 4th 2017.

Software development within a specific project is essentially a time-indexed series of copies of a single “root” directory that contains the entire project source code. Software evolves when a developer modifies the content of one or more files in that directory and records their changes.

Each recorded copy of the root directory is known as a “revision”. It points to a single fully-determined directory and is equipped with arbitrary metadata. Some of those are added manually by the developer (e.g., revision message), others are automatically synthesized (timestamps, parent revision(s), etc).
Each recorded copy of the root directory is known as a “revision”. It points to a single fully-determined directory and is equipped with arbitrary metadata. Some of those are added manually by the developer (for example, a revision message), others are automatically synthesized (timestamps, parent revision(s), and so forth).

The supported metadata is as follows:

Expand Down Expand Up @@ -148,9 +148,9 @@ As an example, `swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d` is the SWHID

## 5.4 Releases

Some revisions get selected by developers as denoting important project milestones known as “releases”. Each release points to the last commit in project history corresponding to the release and carries metadata: release name and version, release message, cryptographic signatures, etc. If they're not attached to development history (e.g. if they've been imported from bare tarballs), releases can also point directly to a root directory instead of a full revision with metadata.
Some revisions get selected by developers as denoting important project milestones known as “releases”. Each release points to the last commit in project history corresponding to that release and carries metadata: release name and version, release message, cryptographic signatures, and so forth. If they're not attached to development history (for instance, if they've been imported from bare archive files), releases can also point directly to a root directory instead of a full revision with metadata.

The supported metadata is as follows:
The metadata fields supported by SWHID are as follows:
- name (arbitrary byte sequence, mandatory): a name identifying the release
- author (arbitrary byte sequence): generally contains the name and email address of the author of the release.
- author timestamp (decimal timestamp from the Unix epoch): the date at which the release was authored.
Expand Down Expand Up @@ -202,9 +202,9 @@ the [Darktable release 2.3.0](https://archive.softwareheritage.org/swh:1:rel:22e

## 5.5 Snapshots

Any kind of software origin offers multiple pointers to the “current” state of a development project. In the case of VCS this is reflected by branches (e.g., master, development, but also so called feature branches dedicated to extending the software in a specific direction); in the case of package distributions by notions such as suites that correspond to different maturity levels of individual packages (e.g., stable, development, etc.).
Any kind of software origin offers multiple pointers to the “current” state of a development project. In the case of VCS this is reflected by branches (for instance, master, development, but also so called feature branches dedicated to extending the software in a specific direction); in the case of package distributions by notions such as suites that correspond to different maturity levels of individual packages (for example, stable, development, and so forth).

A “snapshot” of a given software origin records all entry points found there and where each of them was pointing at the time. For example, a snapshot object might track the commit where the master branch was pointing to at any given time, as well as the most recent release of a given package in the stable suite of a FOSS distribution.
A “snapshot” of a given software origin records all entry points found there and where each of them was pointing at the time. For example, a snapshot object might track the commit where the master branch was pointing to at any given time, as well as the most recent release of a given package in the stable suite of a free and open source software (FOSS) distribution.

Practically, a snapshot is a list of named branches pointing at objects of any of the known types (content, directory, revision, release or snapshot). A branch can also be an alias to another (named) branch, for instance the default `"HEAD"` branch can point at another, more specific, `"refs/heads/main"` branch.

Expand Down Expand Up @@ -249,7 +249,7 @@ proceeds for
[computing identifiers](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects) for
its objects. The `<object_id>` part of a SWHID for a content object is the Git
blob identifier of any file with the same content; for a revision it is the Git
commit identifier for the same revision, etc. This is not the case for snapshot
commit identifier for the same revision, and so forth. This is not the case for snapshot
identifiers, as Git does not have a corresponding object type.

Git compatibility is practical, but incidental and is not guaranteed to be
Expand Down
6 changes: 3 additions & 3 deletions Chapters/6.Qualified_identifiers.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ by ignoring the `lines` qualifier when the `bytes` qualifier is present.
A "line" in the context of a file content refers to a sequence of characters that ends with a line break. This line can contain text, code, or any other form of data. In this specification, the line break is the ASCII LF character.
The "lines" qualifier allows to designate a line range inside a content.
The range can be a single line number, or a pair of line numbers separated by the ASCII `-` character.
Line numbers start from 1, and the range is inclusive, i.e. the fragment includes both the lines numbered as the start and end of the range.
Line numbers start from 1, and the range is inclusive, that is, the fragment includes both the lines numbered as the start and end of the range.

For example, [`swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;lines=9-15`](https://archive.softwareheritage.org/swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;lines=9-15)
designates the function `generate_input_stream` that is found at lines 9 to 15 of the *content* with core SWHID `swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b`.
Expand All @@ -48,7 +48,7 @@ may be a binary file, or a file that uses non standard line termination characte

To overcome the limitations of the lines qualifier, the bytes qualifier allows
designation of a byte range inside a content. The range can be a single byte number, or a pair of byte numbers separated by `-`.
Byte numbers start from 0, and the range is inclusive, i.e. the fragment includes both the bytes numbered as the start and end of the range.
Byte numbers start from 0, and the range is inclusive, that is, the fragment includes both the bytes numbered as the start and end of the range.
If the range is a single byte number, it designates the byte at that specific position.

For example, `swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;bytes=154-315`
Expand All @@ -70,7 +70,7 @@ For example, [`swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https:/
indicates that the content seen previously with the function `generate_input_stream` has
been seen in the Git repository at `https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git`

This qualifier may be helpful to get hold of the full repository where a
This qualifier may help to get hold of the full repository where a
content has been found, but there is no guarantee of success, as an origin
can change or disappear over time (as is the case in the example above, since
gitorious.org was shut down in 2015).
Expand Down
4 changes: 2 additions & 2 deletions Chapters/index.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# The SWHID Specification Version 1.1

Copyright © 2022-2023 SWHID Contributors.
Copyright © 20222023 SWHID Contributors.

This work is licensed under the Community Specification License 1.0.
This work is licensed under the [Community Specification License 1.0](https://spdx.org/licenses/Community-Spec-1.0.html) (Community‑Spec‑1.0).

With thanks to
Alexios Zavras,
Expand Down
Loading