Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application: Data Processing Workflows on IPFS #58

Open
flyingzumwalt opened this issue Mar 17, 2017 · 18 comments
Open

Application: Data Processing Workflows on IPFS #58

flyingzumwalt opened this issue Mar 17, 2017 · 18 comments

Comments

@flyingzumwalt
Copy link

Work in progress - please contribute. See #40.

Essential Use Case: when running data processing/analysis workflows, use IPFS as the storage layer. This allows your workflows to be agnostic about where the data are stored -- pulling all the source data onto a local node before running a workflow is an optimisation choice that can be done on the fly with zero impact on the code. Likewise, the results of the workflows can be written to IPFS and moved around as needed without impacting the referential integrity of your data.

@jbenet
Copy link
Member

jbenet commented Apr 5, 2017

Data Processing Workflows on IPFS

Essential Use Case

  • use ipfs as a source or sink in data processing pipelines

Use Cases

Concrete

  • Alice wants to render all tiles in a world map, using hundreds of worker computers.
  • Bob wants to run calculations over a stream of geo-tagged events, generated by millions of users.
  • Charlie wants to make available a down-sampled, filtered, and versioned archive of astrophysics photography.
  • Dana wants to train a deep-learning object classifier for use in self-driving cars.
  • Eve wants to run speech-to-text transcription and NLP tagging over large volumes of audio data.
  • Faye wants to run tests (and collect results) for her distributed file system on thousands of machines around the world.

Groupings:

  • MapReduce / Hadoop use cases (IPFS takes the role of GFS or HDFS)
  • Spark use cases (IPFS takes the role of data source & sink)
  • Real-time data stream processing (ingest data into IPFS, process it as it appears)
  • Versioning of datasets processed
  • Storing and distributing datasets processed
  • P2P backbone for a distributed computing cluster
  • Scientists and Data Scientists running data experiments
    • in part. Machine Learning, Bioinformatics, Astronomy, etc.

Other:

  • IPFS can be source and sink for data
  • IPFS can store intermediate results
  • IPFS can deduplicate data and maybe work
  • IPFS can version all results
  • libp2p pubsub can be used to announce events

Foundational Features + Functionality

  • ipfs basics (add, cat)
  • ipfs data importers (for better perf and dedup)
  • ipfs versioning
  • very high throughput & perf
  • libp2p pubsub (to announce events)
  • libp2p pluggable routing to have fast content-routing

Existing Projects + Organizations Working in this Area

There have been a number of people who have expressed desire to:

  1. Our own test lab: https://github.com/ipfs/test-lab
  2. things like BrowserCloud, Pando, etc.
  3. things like Golem Project

@jbenet jbenet removed their assignment Apr 5, 2017
@0zAND1z
Copy link

0zAND1z commented May 26, 2018

Read this very interesting architecture: https://www.cse.unsw.edu.au/~hpaik/thesis/showcases/16s2/scott_brisbane.pdf

Planning on building the full scale services..

@saurabhdhupar
Copy link

saurabhdhupar commented May 27, 2018

@0zAND1z
Copy link

0zAND1z commented May 28, 2018

Thanks for adding the link to the full report

@akevy
Copy link

akevy commented Jun 4, 2018

Great, we're trying

@echarles
Copy link

@scottybrisbane do you have a public repo with your ipfs/hdfs integration?

@scottybrisbane
Copy link

@echarles not just yet, although I am planning to post my work. It's very much a POC, but could be a good starting point for anyone wanting to get something going.

I'll update this thread when I post the code.

@echarles
Copy link

Thx @scottybrisbane - POC is very fine. Once published, I expect contributors (like me) to try and let evolve the code. Without pushing you, any ideas on the timeline? (do we speak about days, weeks, months... before having something public?). Btw If you fear uncompleted feature, not-perfect code, no documentation... just push what you have and other will help, that's how opensource works.

@scottybrisbane
Copy link

@echarles I'm hoping to have it up within a few weeks.

@bo-liu
Copy link

bo-liu commented Sep 30, 2018

Hey Folks, Any update on this topic? @scottybrisbane @echarles

@ajbouh
Copy link

ajbouh commented Sep 30, 2018 via email

@bo-liu
Copy link

bo-liu commented Oct 9, 2018

Hi, can I ask what sort of data processing people want to use IPFS for?

Really depends on what kind of data(or which kind of data you are interested in) stored inside IPFS.

@ajbouh
Copy link

ajbouh commented Oct 9, 2018

Right, I'm looking to help make sure IPFS is a good fit for the kind of data processing people want to do.

Getting specific examples helps me ensure we're putting effort in the right places.

@bertrandfalguiere
Copy link

@scottybrisbane Your work is really interesting! The use case is fascinating.
Do you know when you will be able to post your work?

@0zAND1z
Copy link

0zAND1z commented Nov 17, 2018

KIP team(@KIPFoundation) is working on one of the reference implementation that may align with Scotty's work.
More about our implementation of big data persistence will be updated under section 7 here: https://kipfoundation.github.io/techprimer/7-Realm-Storage.html

Look forward to adding HDFS support together!

@Duske
Copy link

Duske commented Apr 5, 2019

@scottybrisbane Great work! Are you going to publish your code? I would really like to make use of it in my own thesis, so if you need some help, just hit me up :)

@jessicaschilling
Copy link

Note: Discussion on applications of IPFS are happening over in the IPFS Forums now ... please continue the discussion there!

This issue is being moved over to the archived repo https://github.com/ipfs/apps/ for reference.

@hsanjuan hsanjuan transferred this issue from ipfs/ipfs Mar 27, 2020
@lookfirst
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests