|
| 1 | + # ETL-Core Documentation |
| 2 | + Here we should explain what the ETL-Core is and how it works. |
| 3 | + |
| 4 | + ## ETL Infrastructure Architecture |
| 5 | + ### Architecture Framework |
| 6 | + The `etl-core` repository will serve as the primary engine for ETL actions, operating at the network level and service level, and can accept custom configurations. Developers will be able to set up custom configurations within `etl-core`. Once the network and export service is selected, users can use `etl-core` to export the desired blockchain data. |
| 7 | + |
| 8 | + |
| 9 | + Currently, the Solana blockchain is supported in [etl-solana-config](https://github.com/BCWResearch/etl-solana-config). |
| 10 | + |
| 11 | + ### Macro Infrastructure |
| 12 | + An RPC node is expected to serve requests. Blocks are continually requested using the node, and if necessary, other data such as accounts may be requested as well. Upon response, the data is converted into a Protocol Buffers data format and sent to a streaming queue, such as Google Cloud Pub/Sub or RabbitMQ. You will need a transformer and loader that listens for the messages, transforms them to match the table schema, and inserts them into BigQuery. |
| 13 | + |
| 14 | + ## Response Deserialization |
| 15 | + To deserialize JSON responses from the blockchain node, we expect the blockchain configuration to specify the structure of the response in a Rust `struct` and annotate it with the `Deserialize` macro from the `serde` library. This macro generates deserialization code for the developer which eases development, but more importantly allows us to deserialize it with the `simd-json` library. |
| 16 | + |
| 17 | + The `simd-json` library uses CPU vector extensions for accelerated JSON deserialization. Currently, the library supports x86 and ARM vector extensions, but falls back to standard deserialization if used on a system that doesn't support SIMD. |
| 18 | + * Since x86's AVX2 is 256-bit, while ARM's NEON is 128-bit, *you can expect best performance on x86*. |
| 19 | + * This library is only used when compiled in the `release` profile, because its error messages are less descriptive. For development, it is recommended that you compile in debug mode (the default profile), which will use the `serde` deserializer, thus providing more descriptive errors. |
| 20 | + |
| 21 | + ## Environmental Variables |
| 22 | + ### Synopsis |
| 23 | + |
| 24 | + You can define enviornmental variables in a `.env` file. Examples are illustrated in `.env.example.` |
| 25 | + |
| 26 | + ### Variables |
| 27 | +- `ENDPOINT` |
| 28 | + **Required**. Specifies the address to use for json RPC requests. |
| 29 | + |
| 30 | + - `FALLBACK_ENDPOINT` |
| 31 | + **Required**. Specifies the address to use for json RPC requests, when the primary endpoint is failing. This value can be the same `ENDPOINT`. |
| 32 | + |
| 33 | + - `NUM_EXTRACTOR_THREADS` |
| 34 | + **Required**. Specifies the number of concurrent threads to run an extract job. |
| 35 | + |
| 36 | + - `ENABLE_METRICS` |
| 37 | + **Required**. This variable determines whether to launch a metrics server to collect metrics for Prometheus. |
| 38 | + |
| 39 | + - `METRICS_ADDRESS` |
| 40 | + Optional. Required only if `ENABLE_METRICS` is true. Specifies the address of the metrics server. |
| 41 | + |
| 42 | + - `METRICS_PORT` |
| 43 | + Optional. Required only if `ENABLE_METRICS` is true. Specifies the port of the metrics server. |
| 44 | + |
| 45 | + - `RABBITMQ_ADDRESS` |
| 46 | + Optional. Required only if _STREAM_EXPORTER_ is set to `RABBITMQ_STREAM`. Specifies the address of RabbitMQ. |
| 47 | + |
| 48 | + - `RABBITMQ_PORT` |
| 49 | + Optional. Required only if _STREAM_EXPORTER_ is set to `RABBITMQ_STREAM`. Specifies the port of RabbitMQ. |
| 50 | + |
| 51 | + - `BIGTABLE_CRED` |
| 52 | + Optional. Specifies the file path of the credential file required to access GCP Bigtable. |
| 53 | + |
| 54 | + - `GCP_CREDENTIALS_JSON_PATH` |
| 55 | + Optional. Required only if _STREAM_EXPORTER_ is set to `GOOGLE_PUBSUB`. Specifies the file path of the credential file required to access Google Pubsub. |
| 56 | + |
| 57 | + - `GOOGLE_PUBSUB_TOPIC` |
| 58 | + Optional. Required only if _STREAM_EXPORTER_ is set to `GOOGLE_PUBSUB`. Specifies the Google Pubsub topic to be used during exporting. It is assumed that the PubSub Topic is already created. |
| 59 | + |
| 60 | + ## Data Extraction |
| 61 | + |
| 62 | + All RPC requests are retried with backoff upon failure, with failures logged at the `warning` level. |
| 63 | + |
| 64 | + Blocks are requested from the node by the `call_getBlock()` function. |
| 65 | + |
| 66 | + The `call_getBlockHeight()` function requests the current block height. |
| 67 | + |
| 68 | + The `call_getMultipleAccounts()` function requests account data for a list of pubkeys. These pubkeys come from the created accounts and token mints in the block data. |
| 69 | + |
| 70 | + The blockchain configuration is expected to define the HTTP requests that these functions make in a `<BLOCKCHAIN_CONFIG>/types/request_types.rs` file. These requests should be specified using `struct`s called `BlockHeightRequest` and `BlockRequest`, and should implement `serde::Serialize`. It is recommended that you annotate the struct with `#[derive(serde::Serialize)]` to simplify this process and generate the code. |
| 71 | + |
| 72 | + ### Concurrency |
| 73 | + |
| 74 | + The master thread continually sends slot values to a concurrent queue for worker threads to index. |
| 75 | + |
| 76 | + Long-lived threads are created at the start of runtime by the master thread, and continually pull tasks (slot values) from the concurrent queue. Each thread makes requests to the node for the block data at that slot, then deserializes the response, and transmits the data to a stream queue. |
| 77 | + * For communication with the stream queue (which supports concurrent producers), each thread serializes its data using the protocol buffers interface, and transmits the information. |
| 78 | + |
| 79 | + ## Features |
| 80 | + |
| 81 | + ### Synopsis |
| 82 | + |
| 83 | + You can either define `--features` in the `Cargo.toml` file inside the `etl-core` repository or specify them as part of a command. |
| 84 | + |
| 85 | + `cargo build --features ARGS...` |
| 86 | + `cargo run --features ARGS...` |
| 87 | + |
| 88 | + The `--features` option is required to build or run the ETL project. |
| 89 | + |
| 90 | + ### Arguments |
| 91 | + |
| 92 | + Currently, the following blockchains are supported: |
| 93 | + - `SOLANA` |
| 94 | + |
| 95 | + A message queue is required to be specified: |
| 96 | + - `RABBITMQ` - a classic RabbitMQ queue |
| 97 | + - `RABBITMQ_STREAM` - a RabbitMQ with Stream Queue plugin |
| 98 | + - `GOOGLE_PUBSUB` - Google Cloud Pub/Sub |
| 99 | + |
| 100 | + ### Examples |
| 101 | + |
| 102 | + 1. Build the local project and its dependencies for the _SOLANA_ blockchain |
| 103 | + ``` |
| 104 | + cargo build --release --features SOLANA,RABBITMQ_STREAM |
| 105 | + ``` |
| 106 | + |
| 107 | + 2. Run the local project and its dependencies for the _SOLANA_blockchain and _RABBITMQ_STREAM_ exporter |
| 108 | + ``` |
| 109 | + cargo run --features SOLANA,RABBITMQ_STREAM |
| 110 | + ``` |
| 111 | + |
| 112 | + ## Limitations |
| 113 | + - Only limited number of `Token-2022 Program` information is extracted. |
| 114 | + - `SOLANA_BIGTABLE` feature can only request 1000 confirmed slots each time. |
| 115 | + |
| 116 | + ## Project Progress |
| 117 | + |
| 118 | + ### Deployment Method |
| 119 | +| Metrics | Development Status | |
| 120 | +| ---------------------- | ------------------ | |
| 121 | +| Dockerfile | In Development | |
| 122 | +| Helm Chart | In Development | |
| 123 | + |
| 124 | + ### Export Method |
| 125 | +| Metrics | Development Status | |
| 126 | +| ---------------------- | ------------------ | |
| 127 | +| CSV | Completed | |
| 128 | +| Google PubSub | Completed | |
| 129 | +| RabbitMQ | Completed | |
| 130 | + |
| 131 | + ### Extraction Source |
| 132 | +| Metrics | Development Status | |
| 133 | +| ---------------------- | ------------------ | |
| 134 | +| Bigtable | Completed | |
| 135 | +| JSON RPC | Completed | |
| 136 | + |
| 137 | + ### Metrics Collection |
| 138 | +| Metrics | Development Status | |
| 139 | +| ---------------------- | ------------------ | |
| 140 | +| Block Request Count | In Development | |
| 141 | +| Failed Block Count | Not Started | |
| 142 | + |
| 143 | + ### Tables |
| 144 | +| Table | Development Status | |
| 145 | +| ---------------- | ------------------ | |
| 146 | +| Accounts | Completed | |
| 147 | +| Blocks | Completed | |
| 148 | +| Instructions | Completed | |
| 149 | +| Tokens | Completed | |
| 150 | +| Token Transfers | Completed | |
| 151 | +| Transactions | Completed | |
| 152 | + |
| 153 | + |
| 154 | + ## Protocol Buffers |
| 155 | + |
| 156 | + We use protocol buffers to serialize our data for transmission to a pub/sub system like RabbitMQ or Google Cloud Pub/Sub. |
| 157 | + |
| 158 | + Some blockchains provide their own protobuf interfaces, so when possible, we will attempt to use those. |
| 159 | + |
| 160 | + ### Codegen |
| 161 | + To generate Rust code from our protobuf interface, we use the `PROST` library. This is a popular library for Rust, and is used by the Solana blockchain with their official "storage" protobuf. We perform this codegen at compile time, using a custom Rust build script: `build_proto.rs`. This script uses the `include!` macro to import the protobuf build script from the blockchain-specific configuration. It is expected that each blockchain config will define its own protobuf build script. |
0 commit comments