Open Discussion with Lurk

Raw Notes / Questions

Performance:
- What performance task will we target (e.g. almost certainly not performant enough for ML requirements around weight transfers - but maybe?)
- What performance/$ will be MVP for us?
- What's the minimum size task where this is valuable?
- What's the minimum time for the smallest possible task? e.g. 10 minutes? 1 hour?
- What's the comparison for perf/$ vs. other platforms for equivalent job?
Core functionality:
- Need to encoding of data correctly so that it can be broken up
Functionality extension:
- Two-level orchestration - a job that then fires off additional jobs.
  - Corollary - how does an orchestration job identify miners who have subsets of the data
  - Corollary - sub jobs will have to have their own job success criteria (e.g. potentially different timeouts, costs, etc), and push results back to the originally executor.
  - Corollary - sub jobs will be paid subsets of the original cost to run, but only if their sub-work has been accepted. This means we will have to think about how to allow "acceptance" of the top-level job to trickle down to acceptance of sub jobs and a way to divvy up the original payment to the amount of work each sub-miner did
    - To what extent are we dependent on a meta-deal-making apparatus to allow apparent pieces to go into multiple sectors
  - Corollary - Where do we store results (both intermediate results and end results) - push back onto the chain to start, but is that a good solution?
    - When we get to cross-miner (which will be required to deal with datasets larger than a single sector), we will have to understand how to share intermediate proofs.
    - Need to define a format for intermediate data plus the requisite state for continuing the computation.
  - Do we need to develop a structure for intermediate entities to be doing the sharding of the work (person A receives the job, uses an indexer to find the shards and does the work to split up and hand-off) - they should extract some value for doing this (even though they didn't do any of the 'actual' work)
    - If they don't see anyone bidding on the work, theoretically they could do the work themselves and charge for that.
  - Eventual flow for multi-miner orchestration:
    - Job layout: request analysis distributed computation aggregation client payment contributor claim and reimbursement
    - Built-in audit trail - signature for everyone who contributed to it (allowing everyone who did work to get paid after the top-level job accepted - because then if you release the data AND it's used THEN you have proof that you used the data)
    - Also incentivizes folks to act quickly - beat other miners to the result - Because the 'winning fork' will be the one that gets paid.
      - If I calculate some results but don't release them fast enough, the final answer will be computed without relying on my contribution.
    - The trick is figuring out how to embed the attribution in the intermediate proofs such that it can be stripped out.
      - Will need to design the computations in such a way that a valid response must include this metadata.
- "Data sets"
  - Need a first class way to address cross-sector data
  - How do we force data to be distributed among many miners (otherwise, there will be less benefit to a system like this)- How do we allow selection of machine profile (I want this to execute on an accelerator/GPU)?
- Compute
  - Possible structure: Miner can verify Lurk proofs, and for smart contracts able to minimally parse its data (which should ultimately be a subset of IPLD)
    - Put a contract on chain that says, I want to perform query Q on data D.
    - Q is a content-addressable expression corresponding to a Lurk program.
    - D is the root hash of some Lurk data (which happens to be stored in some number of sectors).
    - It could also be the case that this data exists outside of Filecoin. The Filecoin part might be cold storage for some other data.
    - But the client (and perhaps the chain itself) can know that some set of miners do have the data.
    - So at least that set of storage providers (miners) will be able to efficiently perform the query.
  - Sub miners claiming portions of payment
    - Setting aside the problem of 'delivery', anyone who can create a proof of correct response, can then claim the payment.
    - The delivery problem is that the proof will just certify that some CID representing the response is correct.
  - How do we verify data was transmitted after computation
    - So the important point is that Lurk provides a mechanism for a kind of 'function-evaluation-addressable data' which can actually be trusted.
    - So in the same way that I can verify a CID and know that the data someone gives me really corresponds to what was intended by the person who gave me the CID…
    - I can verify that the result of a computation really is correct, even though the person who specified the new-kind-of-CID didn't know the result (so couldn't hash it to produce a digest).
  - Could we build a global computational graph that would enable caching/storage of computation on Filecoin sectors - reducing redundant computations?
  - How do we leverages Filecoin more - e.g. Filecoin ses proof of replication for its intended purpose:
    - We could have two queries: certified and uncertified.
    - Uncertified means: here's the data I want to query. If you have it (all of it), go nuts!
    - Certified means:
      - This is a cross-sector query, and I know it will only be answered if adversarial parties cooperate.
      - in addition to proving the data has the root you requested, I must also prove that I possess that data in a Filecoin sector.
- Partnerships
  - Should we partner folks who want to develop a language for certifiable data (IPLD) interpretation and transformation for use in decentralized systems.
  - How coupled should this effort be with other chains and/or FVM?
- Would it be useful to deploy a local test network or would it be ok as a subset of the IPFS network
  - Both have separate use cases
  - We should think about targeting low-trust environment, but allow for a spectrum
- What is the way to describe the higher-level primitive that maps to the entire dataset
  - Imagine you had a compute job with many sequences along many data sets
  - Imagine shipping the entire compute job based on evaluating a condition of what data is here
- Framework for thinking of this - invoke dynamic (from Java)
- Should be executed as a layer 2 protocol
- Make it work with Kubernetes (should have the ability to distribute to pods, not just nodes?)
- Customer dataset built on-prem - clinical diagnostics
  - Works against existing data
  - USED TO:
    - Minio with blob store
    - Do puts into local host
    - Look up via hash
    - Docker container spawn to execute command
  - Company moved away from S3/Minio to IPFS because S3/Minio does not allow for lazy setup/changing of the cluster sizing
    - Lazy pulling of archived
- Alternates: Containers or WASM running in the place (not just in the local experience)
- Value of provability:
  - I'm not running a data center
  - I want to make sure the storage provider (who i don't trust) ran the binary that i handed them
  - HIPAA could encourage the use of zero-knowledge proofs
  - Alternative to proof is to use trusted environment (SGX) to execute compute
  - Plausibility in order:
    - Running a docker container
    - Need to run arbitrary compute - WASM & FVM is not acceptable for now
- Core seems to be disk throughput and locality issue
- Don't use IPLD to manage cross sector node
- Need to support WinCE - file lands on the WinCE node and then the compute gets executed
- Node agent runs on every node and uses inotify (our job is the job that spun up the node sequencer and roughly in 30 hours to fire)
- WASM on lurk
- Have the proof of running
- As a compute requester, I should be able to select the style of running - trusted, proveable, fast, etc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open Discussion with Lurk

Raw Notes / Questions

Clone this wiki locally