๐๏ธ 22120
๐๏ธ - An archivist browser controller that caches everything you browse, a library server with full text search to serve your archive.
News - new binaries
- Overview
- License
- About
- Get 22120
- Using
- Format
- Why not WARC (or another format like MHTML) ?
- How it works
- FAQ
- Do I need to download something?
- Can I use this with a browser that's not Chrome-based?
- How does this interact with Ad blockers?
- How secure is running chrome with remote debugging port open?
- Is this free?
- What's the roadmap?
- What about streaming content?
- Can I black list domains to not archive them?
- Is there a DEBUG mode for troubleshooting?
- Can I version the archive?
- Can I change the archive path?
- Can I change this other thing?
Copyright (c) 2018, 2020, Dosyago and/or its affiliates. All rights reserved.
This is a release of 22120, a web archiver.
License information can be found in the LICENSE file.
This software is dual-licensed. For information about commercial licensing, see Dosyago Commercial License for OEMs, ISVs and VARs.
This project literally makes your web browsing available COMPLETELY OFFLINE. Your browser does not even know the difference. It's literally that amazing. Yes.
Save your browsing, then switch off the net and go to http://localhost:22120
and switch mode to serve then browse what you browsed before. It all still works.
warning: if you have Chrome open, it will close it automatically when you open 22120, and relaunch it. You may lose any unsaved work.
3 ways to get it:
- Get binary from the releases page., or
- Run with npx:
npx archivist1@latest
, ornpm i -g archivist1@latest && archivist1
- Clone this repo and run as a Node.JS app:
npm i && npm start
Also, coming soon is a Chrome Extension.
Go to http://localhost:22120 in your browser, and follow the instructions.
Archive will be located in 22120-arc/public/library
*
But it's not public, don't worry!
You can also check out the archive index, for a listing of every title in the archive. The index is accessible from the control page, which by default is at http://localhost:22120 (unless you changed the port).
*Note:22120-arc
is the archive root of a single archive, and by defualt it is placed in your home directory. But you can change the parent directory for 22120-arc
to have multiple archvies.
The archive format is:
22120-arc/public/library/<resource-origin>/<path-hash>.json
Inside the JSON file, is a JSON object with headers, response code, key and a base 64 encoded response body.
The case for the 22120 format.
Other formats save translations of the resources you archive. They create modifications, such as altering the internal structure of the HTML, changing hyperlinks and URLs into "flat" embedded data URIs, and require other "hacks* in order to save a "perceptually similar" copy of the archived resource.
22120 throws all that out, and calls rubbish on it. 22120 saves a verbatim high-fidelity copy of the resources your archive. It does not alter their internal structure in any way. Instead it records each resource in its own metadata file.
Why?
At 22120, we believe in the resources and in verbatim copies. We don't annoint ourselves as all knowing enough to modify the resource source of truth before we archive it, just so it can "fit the format* we choose. We don't believe we should be modifying or altering resources we archive. We belive we should save them exactly as they were presented. We believe the format should fit (or at least accommodate, and be suited to) the resource, not the other way around. We don't believe in conflating metadata with content; so we separate them. We believe separating metadata and content, and keeping the content pure and altered throughout the archiving process is not only the right thing to do, it simplifies every part of the audit trail, because we know that the modifications between archived copies of a resource of due to changes to the resources themselves, not artefacts of the format or archiving process.
Both WARC and MHTML require mutilatious modifications of the resources so that the resources can be "forced to fit" the format. At 22120, we believe this is not required (and in any case should never be performed). We see it as akin to lopping off the arms of a Roman statue in order to fit it into a presentation and security display box. How ridiculous! The web may be a more "pliable" medium but that does not mean we should treat it without respect for its inherent content.
Why is changing the internal structure of resources so bad?
In our view, the internal structure of the resource as presented, is the cannon. Internal structure is not just substitutable "presentation" - no, in fact it encodes vital semantic information such as hyperlink relationships, source choices, and the "strokes" of the resource author as they create their content, even if it's mediated through a web server or web framework.
Why else is 22120 the obvious and natural choice?
22120 also archives resources exactly as they are sent to the browser. It runs connected to a browser, and so is able to access the full-scope of resources (with, currently, the exception of video, audio and websockets, for now) in their highest fidelity, without modification, that the browser receives and is able to archive them in the exact format presented to the user. Many resources undergo presentational and processing changes before they are presented to the user. This is the ubiquitous, "web app", where client-side scripting enabled by JavaScript, creates resources and resource views on the fly. These sorts of "hyper resources" or "realtime" or "client side" resources, prevalent in SPAs, are not able to be archived, at least not utilizing the normal archive flow, within traditional wget
-based archiving tools.
In short, the web is an online medium, and it should be archived and presented in the same fashion. 22120 archives content exactly as it is received and presented by a browser, and it also replays that content exactly as if the resource were being taken from online. Yes, it requires a browser for this exercise, but that browser need not be connected to the internet. It is only natural that viewing a web resource requires the web browser. And because of 22120 the browser doesn't know the difference! Resources presented to the browser form a remote web site, and resources given to the browser by 22120, are seen by the browser as exactly the same. This ensures that the people viewing the archive are also not let down and are given the change to have the exact same experience as if they were viewing the resource online.
Uses DevTools protocol to intercept all requests, and caches responses against a key made of (METHOD and URL) onto disk. It also maintains an in memory set of keys so it knows what it has on disk.
Yes. But....If you like 22120, you might love the clientless hosted version coming in future. You'll be able to build your archives online from any device, without any download, then download the archive to run on any desktop. You'll need to sign up to use it, but you can jump the queue and sign up today.
No.
But...see #57. Just want to set some expectations, this is only an investigation and considering it, it might not ever get done. But, your voices made a difference, as I wasn't even considering it before.
Interacts just fine. The things ad blockers stop will not be archived.
Seems pretty secure. It's not exposed to the public internet, and pages you load that tried to use it cannot use the protocol for anything (except to open a new tab, which they can do anyway).
Yes this is totally free to download and use. It's also open source (under AGPL-3.0) so do what you want with it. For more information about licensing, see the license section.
- Full text search
- Library server to serve archive publicly.
- Distributed p2p web browser on IPFS
The following are probably hard (and I haven't thought much about):
- Streaming content (audio, video)
- "Impure" request response pairs (such as if you call GET /endpoint 1 time you get "A", if you call it a second time you get "AA", and other examples like this).
- WebSockets (how to capture and replay that faithfully?)
Probably some way to do this tho.
Yes! Put any domains into 22120-arc/no.json
*, eg:
[
"*.horribleplantations.com",
"*.cactusfernfurniture.com",
"*.gustymeadows.com",
"*.nytimes.com",
"*.cnn.co?"
]
Will not cache any resource with a host matching those. Wildcards:
*
(0 or more anything) and?
(0 or 1 anything)
*Note: the no
file is per-archive. 22120-arc
is the archive root of a single archive, and by defualt it is placed in your home directory. But you can change the parent directory for 22120-arc
to have multiple archvies, and each archive requires its own no
file, if you want a blacklist in that archive.
Yes, just make sure you set an environment variable called DEBUG_22120
to anything non empty.
So for example in posix systems:
export DEBUG_22120=True
Yes! But you need to use git
for versioning. Just initiate a git repo in your archive repository. And when you want to save a snapshot, make a new git commit.
Yes, there's a control for changing the archive path in the control page: http://localhost:22120
There's a few command line arguments. You'll see the format printed as the first printed line when you start the program.
For other things you can examine the source code.