-
Notifications
You must be signed in to change notification settings - Fork 90
Open Discussion with Lurk
- Performance:
- What performance task will we target (e.g. almost certainly not performant enough for ML requirements around weight transfers - but maybe?)
- What performance/$ will be MVP for us?
- What's the minimum size task where this is valuable?
- What's the minimum time for the smallest possible task? e.g. 10 minutes? 1 hour?
- What's the comparison for perf/$ vs. other platforms for equivalent job?
- Core functionality:
- Need to encoding of data correctly so that it can be broken up
- Functionality extension:
-
Two-level orchestration - a job that then fires off additional jobs.
- Corollary - how does an orchestration job identify miners who have subsets of the data
- Corollary - sub jobs will have to have their own job success criteria (e.g. potentially different timeouts, costs, etc), and push results back to the originally executor.
- Corollary - sub jobs will be paid subsets of the original cost to run, but only if their sub-work has been accepted. This means we will have to think about how to allow "acceptance" of the top-level job to trickle down to acceptance of sub jobs and a way to divvy up the original payment to the amount of work each sub-miner did
- To what extent are we dependent on a meta-deal-making apparatus to allow apparent pieces to go into multiple sectors
- Corollary - Where do we store results (both intermediate results and end results) - push back onto the chain to start, but is that a good solution?
- When we get to cross-miner (which will be required to deal with datasets larger than a single sector), we will have to understand how to share intermediate proofs.
- Need to define a format for intermediate data plus the requisite state for continuing the computation.
- Do we need to develop a structure for intermediate entities to be doing the sharding of the work (person A receives the job, uses an indexer to find the shards and does the work to split up and hand-off) - they should extract some value for doing this (even though they didn't do any of the 'actual' work)
- If they don't see anyone bidding on the work, theoretically they could do the work themselves and charge for that.
- Eventual flow for multi-miner orchestration:
- Job layout: request analysis distributed computation aggregation client payment contributor claim and reimbursement
- Built-in audit trail - signature for everyone who contributed to it (allowing everyone who did work to get paid after the top-level job accepted - because then if you release the data AND it's used THEN you have proof that you used the data)
- Also incentivizes folks to act quickly - beat other miners to the result - Because the 'winning fork' will be the one that gets paid.
- If I calculate some results but don't release them fast enough, the final answer will be computed without relying on my contribution.
- The trick is figuring out how to embed the attribution in the intermediate proofs such that it can be stripped out.
- Will need to design the computations in such a way that a valid response must include this metadata.
-
"Data sets"
- Need a first class way to address cross-sector data
- How do we force data to be distributed among many miners (otherwise, there will be less benefit to a system like this)- How do we allow selection of machine profile (I want this to execute on an accelerator/GPU)?
-
Compute
- Possible structure: Miner can verify Lurk proofs, and for smart contracts able to minimally parse its data (which should ultimately be a subset of IPLD)
- Put a contract on chain that says, I want to perform query Q on data D.
- Q is a content-addressable expression corresponding to a Lurk program.
- D is the root hash of some Lurk data (which happens to be stored in some number of sectors).
- It could also be the case that this data exists outside of Filecoin. The Filecoin part might be cold storage for some other data.
- But the client (and perhaps the chain itself) can know that some set of miners do have the data.
- So at least that set of storage providers (miners) will be able to efficiently perform the query.
- Sub miners claiming portions of payment
- Setting aside the problem of 'delivery', anyone who can create a proof of correct response, can then claim the payment.
- The delivery problem is that the proof will just certify that some CID representing the response is correct.
- How do we verify data was transmitted after computation
- So the important point is that Lurk provides a mechanism for a kind of 'function-evaluation-addressable data' which can actually be trusted.
- So in the same way that I can verify a CID and know that the data someone gives me really corresponds to what was intended by the person who gave me the CID…
- I can verify that the result of a computation really is correct, even though the person who specified the new-kind-of-CID didn't know the result (so couldn't hash it to produce a digest).
- Could we build a global computational graph that would enable caching/storage of computation on Filecoin sectors - reducing redundant computations?
- How do we leverages Filecoin more - e.g. Filecoin ses proof of replication for its intended purpose:
- We could have two queries: certified and uncertified.
- Uncertified means: here's the data I want to query. If you have it (all of it), go nuts!
- Certified means:
- This is a cross-sector query, and I know it will only be answered if adversarial parties cooperate.
- in addition to proving the data has the root you requested, I must also prove that I possess that data in a Filecoin sector.
- Possible structure: Miner can verify Lurk proofs, and for smart contracts able to minimally parse its data (which should ultimately be a subset of IPLD)
-
Partnerships
- Should we partner folks who want to develop a language for certifiable data (IPLD) interpretation and transformation for use in decentralized systems.
- How coupled should this effort be with other chains and/or FVM?
-
Would it be useful to deploy a local test network or would it be ok as a subset of the IPFS network
- Both have separate use cases
- We should think about targeting low-trust environment, but allow for a spectrum
-
What is the way to describe the higher-level primitive that maps to the entire dataset
- Imagine you had a compute job with many sequences along many data sets
- Imagine shipping the entire compute job based on evaluating a condition of what data is here
-
Framework for thinking of this - invoke dynamic (from Java)
-
Should be executed as a layer 2 protocol
-
Make it work with Kubernetes (should have the ability to distribute to pods, not just nodes?)
-
Customer dataset built on-prem - clinical diagnostics
- Works against existing data
- USED TO:
- Minio with blob store
- Do puts into local host
- Look up via hash
- Docker container spawn to execute command
- Company moved away from S3/Minio to IPFS because S3/Minio does not allow for lazy setup/changing of the cluster sizing
- Lazy pulling of archived
-
Alternates: Containers or WASM running in the place (not just in the local experience)
-
Value of provability:
- I'm not running a data center
- I want to make sure the storage provider (who i don't trust) ran the binary that i handed them
- HIPAA could encourage the use of zero-knowledge proofs
- Alternative to proof is to use trusted environment (SGX) to execute compute
- Plausibility in order:
- Running a docker container
- Need to run arbitrary compute - WASM & FVM is not acceptable for now
-
Core seems to be disk throughput and locality issue
-
Don't use IPLD to manage cross sector node
-
Need to support WinCE - file lands on the WinCE node and then the compute gets executed
-
Node agent runs on every node and uses inotify (our job is the job that spun up the node sequencer and roughly in 30 hours to fire)
-
WASM on lurk
-
Have the proof of running
-
As a compute requester, I should be able to select the style of running - trusted, proveable, fast, etc
-