ipfs-search documentation

Architecture

ipfs-search consists of the following components:

Sniffer
Queue
Crawler
Metadata extractor
Search backend
API
Frontend

Sniffer: snifloop.sh

The sniffer listens to gossip between our IPFS node and others and adds hashes for which a provider is offered to the hashes queue.

Queue: RabbitMQ

RabbitMQ holds a files and a hashes queue with items to be crawled, in a soon-to-be well-defined JSON-format.

Crawler: ipfs-search

Hashes (directories or files)

The crawler takes items of the hashes queue and attempts to list the items using the IPFS RPC API. This will tell it whether the item is a file, a directory or some other type.

In case it's a directory, the directory listing will be added and the referred items will be added to the hashes queue in case they are directories and to the files queue in case they are files.

In the case the crawled item is a file, it will be added to the files queue and no further action is taken.

Files (only files)

Jobs taken from the files queue are guaranteed to be files, metadata extraction and content type detection will be attempted by IPFS TIKA.

Updating items

All indexed items will be initially given a first-seen field and, when seen again, will have their last-seen field set or updated.

References

When an item is referred to from a directory, i.e. when it's found to be a directory item in the hashes queue, it's referenced name and parent directory will be added to the list of references for that given item. This will happen both for new as well as existing items.

Metadata extractor: ipfs-tika

IPFS-TIKA uses the local IPFS gateway to fetch a (named) IPFS resource and streams the resulting data into an Apache TIKA metadata extractor.

It currently extracts body text up to a certain limit, links and any available metadata. In the future we hope to detect the language as well.

Search backend: Elasticsearch

Any crawled items will be stored in Elasticsearch, which has a custom mapping defined to prevent the many returned metadata fields from all being indexed (for obvious efficiency reasons).

It has been found that it is necessary to regularly update the index to circumvent occasional problems with indexing, performance, queries or other factors.

API

The API provides a layer on top of the search backend, providing filtered output and a limited query functionality, as well as reformatting the resulting items.

In the near future we hope to provide an endpoint for adding new items to the crawl queue as well.

Frontend

The frontend is nothing more than a static front to the search API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ipfs-search documentation

Architecture

Sniffer: snifloop.sh

Queue: RabbitMQ

Crawler: ipfs-search

Hashes (directories or files)

Files (only files)

Updating items

References

Metadata extractor: ipfs-tika

Search backend: Elasticsearch

API

Frontend

Files

README.md

Latest commit

History

README.md

File metadata and controls

ipfs-search documentation

Architecture

Sniffer: snifloop.sh

Queue: RabbitMQ

Crawler: ipfs-search

Hashes (directories or files)

Files (only files)

Updating items

References

Metadata extractor: ipfs-tika

Search backend: Elasticsearch

API

Frontend