-
Notifications
You must be signed in to change notification settings - Fork 7
/
wp.tex
522 lines (455 loc) · 21.8 KB
/
wp.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
\documentclass[12pt,oneside]{article}
\usepackage[utf8]{inputenc}
\usepackage{float}
\usepackage[bottom]{footmisc}
\usepackage{bookmark}
\usepackage{microtype}
\usepackage{amsmath}
\usepackage{multicol}
\usepackage{mdframed}
\usepackage{setspace}
\usepackage{pgfplots}
\usepackage{graphicx}
\usepackage{fancyvrb}
\usepackage[absolute]{textpos}\TPGrid{16}{16}
\usepackage{tikz}
\usetikzlibrary{shapes}
\usetikzlibrary{arrows.meta}
\usetikzlibrary{arrows}
\usetikzlibrary{shadows}
\usetikzlibrary{trees}
\usetikzlibrary{fit}
\usetikzlibrary{calc}
\usetikzlibrary{positioning}
\usetikzlibrary{decorations.pathmorphing}
\usepackage{./tikz-uml}
\usepackage{everypage}
\AddEverypageHook{
\begin{textblock}{0.5}[0,0](0,0)
\tikz \node[fill=myred,minimum width=0.5\TPHorizModule,minimum height=16\TPVertModule] {};
\end{textblock}
\begin{textblock}{0.125}[0,0](0.5,0)
\tikz \node[fill=myblack,inner sep=0, minimum width=0.125\TPHorizModule,minimum height=16\TPVertModule] {};
\end{textblock}
}
\usepackage{xcolor}
\definecolor{firebrick}{HTML}{B22222}
\definecolor{myred}{HTML}{CF0A2C}
\definecolor{myblack}{HTML}{232527}
\newcommand\dd[1]{\colorbox{gray!30}{\texttt{#1}}}
\usepackage{hyperref}
\hypersetup{colorlinks=true,allcolors=blue!40!black}
\setlength{\topskip}{6pt}
\setlength{\parindent}{0pt} % indent first line
\setlength{\parskip}{6pt} % before par
% \let\oldsection\section\renewcommand\section{\newpage\oldsection}
\date{\small\today}
\title{%
Binary Repository Manager\\
\colorbox{firebrick}{\small\sffamily\color{white}{White Paper}}}
\usepackage[style=authoryear,sorting=nyt,backend=biber,
hyperref=true,abbreviate=true,
maxcitenames=1,maxbibnames=1]{biblatex}
\renewbibmacro{in:}{}
\addbibresource{books.bib}
\tikzset{node distance=1.6cm, auto, every text node part/.style={align=center, font={\sffamily\small}}}
\tikzstyle{block} = [draw=myblack, fill=white, inner sep=0.3cm, outer sep=0.1cm, thick]
\tikzstyle{ln} = [draw, ->, very thick, arrows={-triangle 90}, every text node part/.append style={font={\sffamily\scriptsize}}]
\begin{document}
\raggedbottom
\maketitle
\begin{abstract}
A software project of almost any size needs to keep its binary artifacts
in a repository, to enable access to them by programmers, tools, and other teams.
The quality of the software that manages the repository matters. There are
a few categories of such a software, which have their
pros and cons, currently on the market. However, none of them fully satisfy
the requirements of a large group of software companies.
That's why a new product is being created.
\end{abstract}
% \onehalfspace
\section{Introduction}
\href{https://en.wikipedia.org/wiki/Binary_repository_manager}{Binary Repository Manager}
(BRM), according to Wikipedia, is ``a software tool designed to optimize the download and storage of
binary files used and produced in software development,'' for example
\dd{.jar} or \dd{.zip} archives. BRM is a critical component of
most \href{https://en.wikipedia.org/wiki/DevOps_toolchain}{DevOps toolchains}~\parencite{erich2018},
residing right after the build pipeline, which is why it is sometimes
called ``build repository'', ``artifact repository'', or ``pipeline state repository''~\parencite{bass15}.
A traditional DevOps pipeline, as explained by~\textcite{humble2010}, expects
the source code to be validated, tested, packaged and versioned automatically
into an \emph{artifact} (a binary file).
Then, the artifact must be stored outside of the source
code repository and become available for later stages of the continuous
delivery pipeline. The BRM is supposed to host these artifacts,
being ``a central point for management of binaries and dependencies,
and an integrated depot for build promotions of internally developed software,''
as noted by~\textcite{davis2016}.
\begin{figure}[H]
\centering
\begin{tikzpicture}
\node[block, minimum width=4cm] (src) {Source code};
\node[block, below=of src, minimum width=4cm] (commit) {\textless commit\textgreater};
\node[fit=(commit), draw, dashed, inner sep=0.3cm, label={[label distance=0.1cm]0:Repository (git)}] (repo) {};
\node[block, below=of commit, minimum width=4cm] (build) {.jar};
\node[fit=(build), draw, dashed, inner sep=0.3cm, label={[label distance=0.1cm]0:CI/CD (build)}] (cicd) {};
\node[block, below=of build, minimum width=4cm] (artifact) {com.artipie:artipie:1.0};
\node[fit=(artifact), draw, dashed, inner sep=0.3cm, label={[label distance=0.1cm]0:BRM (Artipie)}] (brm) {};
\node[block, below=of artifact, minimum width=4cm] (users) {Users};
\path[ln] (src) edge node {push} (commit);
\path[ln] (commit) edge node {trigger} (build);
\path[ln] (build) edge node {deploy} (artifact);
\draw[ln] (artifact.west) -- ([xshift=-10mm]artifact.west) -- node[left] {depends} ([xshift=-10mm]src.west) -- (src.west);
\draw[ln] (artifact.south) -- node[right] {uses} (users);
\end{tikzpicture}
\caption{The deliver pipeline of an average software development team.}
\label{fig:map}
\end{figure}
The Section~\ref{sec:requirements} lists a few categories of existing
BRM solutions, analyses requirements their customers may have, and emphasize
the most important functional features and non-functional requirements.
\section{Requirements}
\label{sec:requirements}
All existing BRM solutions can be categorized as
public, commercial, hosted, open source, or surrogate.
Even though each of
them partially satisfy the needs of a professional software team, neither
one is perfect.
\begin{description}
\item[Public]
There are a few hosted BRMs for different programming languages, like
\href{https://search.maven.org/}{Maven Central} for Java or
\href{https://www.rubygems.org}{Rubygems} for Ruby, which are free to use,
but do not allow private accounts. This means that all artifacts deployed
by some user become available for all other users. This business model is
acceptable for open source projects, but is not suitable for software teams
that develop proprietary software products.
\item[Commercial]
There are a few BRMs, like
\href{https://jfrog.com/artifactory/}{Artifactory}/\href{https://jfrog.com/bintray/}{Bintray}
of \href{https://jfrog.com}{JFrog}
and \href{https://www.sonatype.com/nexus-repository-oss}{Nexus}
of \href{https://www.sonatype.com/}{Sonatype},
which provide most of the features required by software teams, including
fine-grained access control, versioning, seamless integration with build
automation software, and many more. However, these tools are pretty expensive\footnote{%
The annual cost of a license for a mid-size team of 50-100 developers is:
\href{https://jfrog.com/pricing/}{around} \$30,000 for Artifactory
and
\href{https://www.sonatype.com/product-pricing}{around} \$50,000 for Nexus.
There are less expensive products too:
\href{https://inedo.com/proget/pricing}{ProGet} for \$10,000,
}
and require certain skills to install and manage them. Moreover, their
authors (including JFrog and Sonatype) are US-based companies, who may be
restricted to sell their products to software teams from certain ``sanctioned'' countries\footnote{%
\href{https://techcrunch.com/2019/07/29/github-ban-sanctioned-countries/}{29th of July, 2019}:
GitHub, the world's largest host of source code, is preventing users in Iran, Syria, Crimea.
\href{https://techcrunch.com/2018/12/22/slack-says-it-will-comply-with-sanctions/}{22nd of December, 2018}:
Slack confirms it will now block all activity in Iran and other sanctioned countries.
}.
\item[Hosted]
There are a few BRMs, which maintain artifacts on their servers,
like \href{https://www.cloudrepo.io/pricing.html}{CloudRepo} for Java
or \href{https://pydist.com/}{PyDist} for Python.
Some BRM creators, like JFrog, provide their products in hosted versions too.
However, some software teams may not find this option acceptable
due to security reasons---eventually the data may be lost, if the company
gets out of the market\footnote{%
\href{https://www.theverge.com/2015/3/13/8206903/google-code-is-closing-down-github-bitbucket}{13th of March, 2015}:
Google Code, one of the largest source code repository managers, closed its doors.
}
or due to sanctions.
\item[Open source]
There are also a few entirely free and open source products, like
\href{https://archiva.apache.org/index.cgi}{Archiva}, which software
teams must install, configure and use on their own risk. Even though
this may sounds like a good solution for a small team, it may not be
acceptable for a larger group of software developers, who expect their
artifact repository to be reliable and available.
\item[Surrogate]
It is possible to organize a BRM without any software,
\href{https://www.yegor256.com/2015/09/07/maven-repository-amazon-s3.html}{for example},
on top of \href{https://aws.amazon.com/s3/}{Amazon S3}
or a simple FTP server. With the right plugin
Maven can deploy to Amazon S3 and then fetch artifacts from there
via their built-in HTTP interface. However, such a solution gives
very little or no control for a DevOps person and may only
work for rather small software teams.
\end{description}
\subsection{Features}
\label{sec:features}
There are many important qualities and features software developers and DevOps
engineers expect a BRM to have, in order to be useful in a continuous
delivery pipeline. The most critical
\href{https://en.wikipedia.org/wiki/Non-functional_requirement}{non-functional requirements}
are:
\begin{description}
\item[Integrability]
There are plenty of build automation tools for each programming language,
like \href{https://maven.apache.org/}{Maven} for Java,
\href{https://www.npmjs.com/}{Npm} for JavaScript, or
\href{https://github.com/ruby/rake}{Rake} for Ruby.
There are also many continuous integration tools, like
\href{https://jenkins.io/}{Jenkins} or \href{https://travis-ci.org/}{Travis}.
Since automation is the most important aspect of DevOps, as noted by~\textcite{kerzazi2016},
it is expected to have plugins for each or most of them, to enable seamless
intregration with the BRM.
\item[Availability]
Artifacts are important components of a software development process
and they must be available right when they are needed by a programmer
or a build tool, without even small delays and delivered at the highest
possible speed.
\item[Scalability]
Most build artifacts are large binary files. Some of them may even
be larger than 1Gb, for example \href{https://www.docker.com/}{Docker}
images or \dd{.war} production-ready Java archives.
The BRM must be able to maintain large data sets, up to almost no limits.
\item[Extensibility]
It is highly desireable to have full access to the source code of the
BRM and to have an ability to extend it with new plugins and modules. Moreover,
vendor independence is important.
\item[Reliability]
An ability to corrupt the data due to software or harware failures must be
eliminated, as much as it is possible.
\end{description}
\subsection{Non-functional Requirements}
\label{sec:nfr}
The most important \href{https://en.wikipedia.org/wiki/Functional_requirement}{functional requirements} are:
\begin{description}
\item[Versions and Tags]
New artifacts must not replace previously deployed ones.
Instead, older versions must always be accessible. However,
it is not expected that the BRM would assign version tags automatically,
this is done at the pipeline's side.
\item[Access Control]
Larger teams may need to control who is allowed to use certain artifacts.
Moreover, such teams may need to require the integration of authentication
mechanisms of the BRM with the existing enterprise access-control system,
via \href{https://en.wikipedia.org/wiki/Lightweight_Directory_Access_Protocol}{LDAP},
for example. On top of regular access control, encryption mechanisms must
be in place in order to prevent data leakage in case of software/hardware
failures or human mistakes, as noted by~\textcite{paule2018}.
\item[Analytics]
Traceability between software artifacts is considered a very
important factor in today development process, as noted by~\textcite{palihawadana2017}.
BRM must make it possible to visualize dependencies between artifacts and
operate on them.
\end{description}
There are many other essential features required, including
authentication and authorization, deployment, publishing,
download, removal, usage tracking, email notifications, mirroring,
and so on.
\subsection{Compare with existing solutions}
\label{ref:comparation}
\begin{tabular}{| l | p{3cm} | p{3cm} | p{3cm} |}
\hline
Feature & Artipie & Artifactory & Nexus \\
\hline
Cloud storage providers & S3 based API clouds & AWS S3, Google Cloud, Azure & No \\
\hline
Supported repository type
& Maven, RPM, NPM, Docker, NuGet, PHP-composer, binary files, PyPi, Go, Gem
& Bower, Chef, CocoaPods, Conan, Conda, CRAN, Debian, Docker, Git LFS, Go, Helm, Maven, NPM, NuGet, Opkg, P2, PHP-composer, Puppet, PyPi, RPM, Gem, SBT, Vagrant, VCS
& Bower, Docker, NPM, PyPI, Raw, RubyGems, Yum* \\
\hline
Installation and maintainance
& Easy to install via Docker image or to cluster
& Can be installed locally, but require a team of system administrators to support the cluster
& Same as Artifactory \\
\hline
Performance & TODO & TODO & TODO \\
\hline
\end{tabular}
* Nexus can uses external plugins to support more repository types
\subsection{Expected Metrics}
\label{ref:metrics}
In a large enterprise it is expected to have the following
numbers, in terms of load, size, and speed:
\begin{tabular}{ll}
Users, total & 80K \\
Artifacts hosted & 100M \\
New artifacts uploaded, daily & 10K \\
Data hosted & 100Tb \\
Data uploaded, daily & 10Gb \\
Concurrent connections, peek & 10K \\
Upload bandwidth, peek & 10M/s \\
Download bandwidth, peek & 100M/s \\
\end{tabular}
Smaller companies may have lower expectations.
\section{Architecture}
Architecure is consisted of 4 essential parts:
\begin{enumerate}
\item Artipie HTTP engine
\item Authorization and authentication layer
\item Repositories
\item Storage
\end{enumerate}
\begin{tikzpicture}
\node[block] (repo-2) {Repo};
\node[block, left=of repo-2] (repo-1) {Repo};
\node[block, right=of repo-2] (repo-3) {Repo};
\node[fit=(repo-1)(repo-2)(repo-3), draw, dashed, inner sep=1cm, label={[right=10cm, below=0.1cm]1cm:Repositories}] (repos) {};
\node[block, above=of repos, minimum width=7cm] (auth) {authentication};
\node[block, above=of auth, minimum width=7cm] (http) {HTTP layer, routing and dispatching};
\node[block, below=of repo-1] (storage-1) {Storage};
\node[block, below=of repo-2] (storage-2) {Storage};
\node[block, below=of repo-3] (storage-3) {Storage};
\path[ln] (repo-1) -- (storage-1);
\path[ln] (repo-2) -- (storage-2);
\path[ln] (repo-3) -- (storage-3);
\path[ln] (http) -- (auth);
\path[ln] (auth) edge node {routes} (repos);
\draw (repos.north) -- (repo-1.north);
\draw (repos.north) -- (repo-2.north);
\draw (repos.north) -- (repo-3.north);
\end{tikzpicture}
\subsection{Design considerations}
All of the Artipie components are based on reactive, asynchronous, non-blockng
and back-pressured streams and asynchronous, reactive and non-blocking
programming principles, allowing Artipie to withstand heavy workloads with a small
amount of kernel threads.
\subsection{Artipie engine}
\label{sec:arti-engine}
Artipie engine is a Java application, which exposes an HTTP endpoint for repository access and management operations.
It routes HTTP requests to repositories and provide authentication mechanisms for repositories.
Each repository encapsulates storage API to access binary blobs and metadata files.
\begin{tikzpicture}[]
\node[block, minimum width=3cm] (repo) {Repository};
\node[block, minimum width=3cm, right=of repo] (storage) {Storage};
\matrix[row sep=10mm, column sep=10mm, yshift=-2cm]{
\node[block, minimum width=3cm] (rpm) {RPM}; &
\node[block, minimum width=3cm] (npm) {NPM}; &
\node[block, minimum width=3cm] (maven) {Maven}; \\
};
\matrix[row sep=10mm, column sep=10mm, yshift=-4cm]{
\node[block, minimum width=3cm] (nuget) {Nuget}; &
\node[block, minimum width=3cm] (composer) {PHP}; &
\node[block, minimum width=3cm] (pypi) {PyPi}; &
\node[block, minimum width=3cm] (docker) {Docker}; \\
};
\draw[-latex] (repo) -- (storage);
\draw[-latex] (npm.north) -- (repo.south);
\draw (rpm.north) |- (270:10mm) -| ([yshift=-1pt]repo.south);
\draw (maven.north) |- (270:10mm) -| ([yshift=-1pt]repo.south);
\draw (nuget.north) |- (270:10mm) -| ([yshift=-1pt]repo.south);
\draw (composer.north) |- (270:10mm) -| ([yshift=-1pt]repo.south);
\draw (pypi.north) |- (270:10mm) -| ([yshift=-1pt]repo.south);
\draw (docker.north) |- (270:10mm) -| ([yshift=-1pt]repo.south);
\end{tikzpicture}
There are various storage implementations, such as:
\begin{enumerate}
\item File system storage
\item S3 based storage
\item In-memory storage
\end{enumerate}
\begin{tikzpicture}[]
\node[block] (storage) {Storage};
\node[block, minimum width=2cm, below=of storage] (s3) {S3};
\node[block, minimum width=2cm, right=of s3] (fs) {FS};
\node[block, minimum width=2cm, left=of s3] (im) {IM};
\draw[-latex] (s3.north) -- (storage.south);
\draw (fs.north) |- (270:10mm) -| ([yshift=-1pt]storage.south);
\draw (im.north) |- (270:10mm) -| ([yshift=-1pt]storage.south);
\end{tikzpicture}
The common data flow for Artipie upload: client is sending some binary artifact
to the server, server find responsible repository to process the request,
repository is saving the stream to the storage, and after complete it updates metadata
of the repository (it's most common scenario, some repositories works differently,
e.g Docker uses metadata as path).
\begin{center}
\begin{tikzpicture}
\begin{umlseqdiag}
\umlactor[class=Client]{mvn}
\umlobject[class=Server]{Artipie}
\umlobject[class=Repo]{Maven}
\umlobject[class=Storage]{S3}
\begin{umlcall}[op={mvn:deploy}, type=synchron, return={200 OK}]{mvn}{Artipie}
\begin{umlcall}[op={request}, type=synchron, return={response}]{Artipie}{Maven}
\begin{umlcall}[op={upload}, type=asynchron, return={complete}]{Maven}{S3}
\end{umlcall}
\begin{umlcall}[op={update metadata}, type=asynchron, return={complete}]{Maven}{S3}
\end{umlcall}
\end{umlcall}
\end{umlcall}
\end{umlseqdiag}
\end{tikzpicture}
\end{center}
Engine store repository configuration in a .yml file. An example:
\begin{Verbatim}[tabsize=2]
repo:
type: maven
storage:
type: s3
url: s3://acme.com/snapshot
username: admin
password: 123qwe
\end{Verbatim}
This configuration file says that it has a type of maven repository,
which automatically enable maven specific metadata generation logic.
Diving into storage section, artipie asked to use S3 object storage
as a storage for uploaded artifacts and generated metadata.
The ability to chose where to store artifact gave us the flexibility of choice.
We can choose any type of storage, whether it is a server file system,
an object storage, or a key-value database.
The only requirement is: \hyperref[sec:asto]{Abstract storage} should support it.
Aside from repository configuration, the way artipie stores its settings
also can be customized via artipie.yml file:
\begin{Verbatim}[tabsize=2]
meta:
storage:
type: fs
path: /artipie/storage
\end{Verbatim}
Artipie will resolve this file, and use local filesystem folder for repository settings management.
\subsection{Repository adapters}
Repository adapters are independent projects,
aimed to implement meta information generation
layers for a specific package type(npm, maven, etc...).
\hyperref[sec:arti-engine]{Artipie engine} utilizes adapters in order to provide BRM functionality.
Existed adapters:
\begin{itemize}
\item RPM - \href{https://github.com/artipie/rpm-adapter}{artipie/rpm-adapter}
\item NPM - \href{https://github.com/artipie/npm-adapter}{artipie/npm-adapter}
\item Go - \href{https://github.com/artipie/go-adapter}{artipie/go-adapter}
\item Docker - \href{https://github.com/artipie/docker-adapter}{artipie/docker-adapter}
\item Maven - \href{https://github.com/artipie/maven-adapter}{artipie/maven-adapter}
\item Gem - \href{https://github.com/artipie/gem-adapter}{artipie/gem-adapter}
\end{itemize}
\subsection{Abstract storage}
\label{sec:asto}
Abstract storage(asto) is an abstraction over physical data storage system.
It has a simple interface consisted of two operations: put and get.
The simplicity makes it easy to implement the interface of almost any data storage system.
Those design requirements were considered as most important for the asto:
\begin{enumerate}
\item High performance
\item Back pressure of data streams on the level of bytes
\item Constant memory pool per data stream
\item Pure java interface, without any external dependencies
\item High operation latency awareness
\end{enumerate}
The following design options has been considered for the interface design implementation:
\begin{itemize}
\item \dd{java.io.\{In,Out\}putStream}'s based option.
\item RxJava 3 based option.
\item CompletableFuture and Java 9 Flow based option.
\end{itemize}
The \dd{java.io.\{In,Out\}putStream}'s approach has a conceptual drawback: it's blocking nature,
which affects performance by forcing new thread allocation per user connection. And that is
also a reason why any other blocking approach was not considered.
The RxJava 3 option is close to the ideal one, but the negative side is external
dependency exposition: clients are getting bounded to the RxJava primitives.
Java non-blocking primitives were counted as the most promising
ones, since \dd{CompletableFuture} and \dd{Flow}-based interface can be implemented in a
high-performance way and with accordance with all the mentioned requirements.
\subsection{Extensions}
to be written...
\section{Conclusion}
\label{sec:conclusion}
To be written...
\subsection{Acknowledgements}
\label{sec:ack}
The document was originally created by Yegor Bugayenko (y00538675).
\printbibliography%
\end{document}