Skip to content

Open Discussion with Lurk

David Aronchick edited this page Jan 25, 2022 · 1 revision

Raw Notes / Questions

  • Performance:
    • What performance task will we target (e.g. almost certainly not performant enough for ML requirements around weight transfers - but maybe?)
    • What performance/$ will be MVP for us?
    • What's the minimum size task where this is valuable?
    • What's the minimum time for the smallest possible task? e.g. 10 minutes? 1 hour?
    • What's the comparison for perf/$ vs. other platforms for equivalent job?
  • Core functionality:
    • Need to encoding of data correctly so that it can be broken up
  • Functionality extension:
    • Two-level orchestration - a job that then fires off additional jobs.

      • Corollary - how does an orchestration job identify miners who have subsets of the data
      • Corollary - sub jobs will have to have their own job success criteria (e.g. potentially different timeouts, costs, etc), and push results back to the originally executor.
      • Corollary - sub jobs will be paid subsets of the original cost to run, but only if their sub-work has been accepted. This means we will have to think about how to allow "acceptance" of the top-level job to trickle down to acceptance of sub jobs and a way to divvy up the original payment to the amount of work each sub-miner did
        • To what extent are we dependent on a meta-deal-making apparatus to allow apparent pieces to go into multiple sectors
      • Corollary - Where do we store results (both intermediate results and end results) - push back onto the chain to start, but is that a good solution?
        • When we get to cross-miner (which will be required to deal with datasets larger than a single sector), we will have to understand how to share intermediate proofs.
        • Need to define a format for intermediate data plus the requisite state for continuing the computation.
      • Do we need to develop a structure for intermediate entities to be doing the sharding of the work (person A receives the job, uses an indexer to find the shards and does the work to split up and hand-off) - they should extract some value for doing this (even though they didn't do any of the 'actual' work)
        • If they don't see anyone bidding on the work, theoretically they could do the work themselves and charge for that.
      • Eventual flow for multi-miner orchestration:
        • Job layout: request analysis distributed computation aggregation client payment contributor claim and reimbursement
        • Built-in audit trail - signature for everyone who contributed to it (allowing everyone who did work to get paid after the top-level job accepted - because then if you release the data AND it's used THEN you have proof that you used the data)
        • Also incentivizes folks to act quickly - beat other miners to the result - Because the 'winning fork' will be the one that gets paid.
          • If I calculate some results but don't release them fast enough, the final answer will be computed without relying on my contribution.
        • The trick is figuring out how to embed the attribution in the intermediate proofs such that it can be stripped out.
          • Will need to design the computations in such a way that a valid response must include this metadata.
    • "Data sets"

      • Need a first class way to address cross-sector data
      • How do we force data to be distributed among many miners (otherwise, there will be less benefit to a system like this)- How do we allow selection of machine profile (I want this to execute on an accelerator/GPU)?
    • Compute

      • Possible structure: Miner can verify Lurk proofs, and for smart contracts able to minimally parse its data (which should ultimately be a subset of IPLD)
        • Put a contract on chain that says, I want to perform query Q on data D.
        • Q is a content-addressable expression corresponding to a Lurk program.
        • D is the root hash of some Lurk data (which happens to be stored in some number of sectors).
        • It could also be the case that this data exists outside of Filecoin. The Filecoin part might be cold storage for some other data.
        • But the client (and perhaps the chain itself) can know that some set of miners do have the data.
        • So at least that set of storage providers (miners) will be able to efficiently perform the query.
      • Sub miners claiming portions of payment
        • Setting aside the problem of 'delivery', anyone who can create a proof of correct response, can then claim the payment.
        • The delivery problem is that the proof will just certify that some CID representing the response is correct.
      • How do we verify data was transmitted after computation
        • So the important point is that Lurk provides a mechanism for a kind of 'function-evaluation-addressable data' which can actually be trusted.
        • So in the same way that I can verify a CID and know that the data someone gives me really corresponds to what was intended by the person who gave me the CID…
        • I can verify that the result of a computation really is correct, even though the person who specified the new-kind-of-CID didn't know the result (so couldn't hash it to produce a digest).
      • Could we build a global computational graph that would enable caching/storage of computation on Filecoin sectors - reducing redundant computations?
      • How do we leverages Filecoin more - e.g. Filecoin ses proof of replication for its intended purpose:
        • We could have two queries: certified and uncertified.
        • Uncertified means: here's the data I want to query. If you have it (all of it), go nuts!
        • Certified means:
          • This is a cross-sector query, and I know it will only be answered if adversarial parties cooperate.
          • in addition to proving the data has the root you requested, I must also prove that I possess that data in a Filecoin sector.
    • Partnerships

      • Should we partner folks who want to develop a language for certifiable data (IPLD) interpretation and transformation for use in decentralized systems.
      • How coupled should this effort be with other chains and/or FVM?
    • Would it be useful to deploy a local test network or would it be ok as a subset of the IPFS network

      • Both have separate use cases
      • We should think about targeting low-trust environment, but allow for a spectrum
    • What is the way to describe the higher-level primitive that maps to the entire dataset

      • Imagine you had a compute job with many sequences along many data sets
      • Imagine shipping the entire compute job based on evaluating a condition of what data is here
    • Framework for thinking of this - invoke dynamic (from Java)

    • Should be executed as a layer 2 protocol

    • Make it work with Kubernetes (should have the ability to distribute to pods, not just nodes?)

    • Customer dataset built on-prem - clinical diagnostics

      • Works against existing data
      • USED TO:
        • Minio with blob store
        • Do puts into local host
        • Look up via hash
        • Docker container spawn to execute command
      • Company moved away from S3/Minio to IPFS because S3/Minio does not allow for lazy setup/changing of the cluster sizing
        • Lazy pulling of archived
    • Alternates: Containers or WASM running in the place (not just in the local experience)

    • Value of provability:

      • I'm not running a data center
      • I want to make sure the storage provider (who i don't trust) ran the binary that i handed them
      • HIPAA could encourage the use of zero-knowledge proofs
      • Alternative to proof is to use trusted environment (SGX) to execute compute
      • Plausibility in order:
        • Running a docker container
        • Need to run arbitrary compute - WASM & FVM is not acceptable for now
    • Core seems to be disk throughput and locality issue

    • Don't use IPLD to manage cross sector node

    • Need to support WinCE - file lands on the WinCE node and then the compute gets executed

    • Node agent runs on every node and uses inotify (our job is the job that spun up the node sequencer and roughly in 30 hours to fire)

    • WASM on lurk

    • Have the proof of running

    • As a compute requester, I should be able to select the style of running - trusted, proveable, fast, etc

Clone this wiki locally