-
Notifications
You must be signed in to change notification settings - Fork 90
Bacalhau project report 20220715
CI was a right pain in the ass this week, we had issues both with test flakiness and CircleCI giving us grief with tag builds failing not letting us release.
The good news is that Guy made a heroic effort to fix test flakiness, and we've got CI much much more reliable than it was before:
And the releasing issue is fixed (gives CircleCI a funny look about persisting via workspaces randomly giving you old versions of files), so we were able to release to production again!
We did the final piece of DevOps work to enable production metrics and traces in honeycomb now:
This will be of great value to us in investigating performance and behavior on the production network. We also have prometheus instrumentation ready in the code, and a plan for how to deploy Prometheus to each production node and remote_write to Grafana Cloud.
You can now use -v
and -o
to mount input and output volumes to Python code running in the WASM runtime (with bacalhau run python --deterministic
). These volumes are now plumbed all the way through to the CPython instance's virtual filesystem running inside the Node.js WebAssembly runtime. This opens the door to deterministic and therefore verifiable data processing on content-addressible data in IPFS, which is very exciting as it starts to build out a global graph structure of verified y=f(x)
where x
, f
and y
are content-addressed, and f
is deterministic and y
is verified. For all human knowledge!
Kai is heads down on a big change under the banner of the "datastore interface" - this unifies state on the system under a local interface, removing race conditions about updates to local knowledge that were starting to be problematic - and which would slow us down in the future. This will also unlock persistence across Bacalhau nodes restarting, so for example we'll eventually stop forgetting about old jobs every time we upgrade the production network. This is a good foundational improvement to the codebase.
The datastore interface is a prerequisite to implementing the parallelism/sharding design.
One fun bug was that running the same job twice on the same node didn't work because the CID was already downloaded - so we fixed that, taking care to ensure atomicity for concurrent use - and fixed bacalhau get
repeatedly downloading the same CID at the same time!
This week we were sad to say goodbye to the most excellent Guy Paterson-Jones, who is off to try his hand at being a rock-climbing instructor. Guy's contributions have been epic and he's awesome, and will be missed.
We are also thrilled to welcome Phil Winder and Enrico Rotundo to the team. Kai and I have worked with Phil and Enrico on several other projects and we know they'll be able to start making productive contributions to the project in no time! The first issues they'll pick up are GPU support and loading external files via HTTP (including e.g. large lists of URLs) as inputs into Bacalhau jobs, both of which will directly drive user success.
So: quite a bit of my time was spent doing handover with Guy and onboarding Phil and Enrico. Onwards and upwards!
- Land datastore interface
- More scaling work
- Implement parallelism/sharding
- GPUs
- HTTP input files