Skip to content

tontinton/toshokan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

66c17da · Jul 27, 2024
Jun 29, 2024
Jun 29, 2024
Jun 29, 2024
Jul 1, 2024
Jun 29, 2024
Jun 29, 2024
May 13, 2024
Jun 29, 2024
Jul 27, 2024
Jul 27, 2024
Jul 27, 2024
Jul 27, 2024
Jul 27, 2024
Jun 7, 2024

Repository files navigation

Introduction

toshokan is a search engine (think Elasticsearch, Splunk), but storing the data on object storage, most similar to Quickwit.

It uses:

  • tantivy - for building and searching the inverted index data structure.
  • Apache OpenDAL - for an abstraction over object storages.
  • PostgreSQL - for storing metadata atomically, removing data races.

I've also posted a blog post explaining the benefits and drawbacks of using an object storage for data intensive applications.

Architecture

How to use

toshokan create example_config.yaml

# Index a json file delimited by new lines.
toshokan index test ~/hdfs-logs-multitenants-10000.json

# Index json records from kafka.
# Every --commit-interval, whatever was read from the source is written to a new index file.
toshokan index test kafka://localhost:9092/topic --stream

toshokan search test "tenant_id:[60 TO 65} AND severity_text:INFO" --limit 1 | jq .
# {
#   "attributes": {
#     "class": "org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace"
#   },
#   "body": "src: /10.10.34.30:33078, dest: /10.10.34.11:50010, bytes: 234, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-202827006_103, offset: 0, srvID: d9ef1b17-4314-4cd8-91eb-095413c3427f, blockid: BP-108841162-10.10.34.11-1440074360971:blk_1074072709_331885, duration: 2571934",
#   "resource": {
#     "service": "datanode/01"
#   },
#   "severity_text": "INFO",
#   "tenant_id": 61,
#   "timestamp": "2016-04-13T06:46:54Z"
# }

# Merge index files for faster searching.
toshokan merge test

toshokan drop test