Updating CUDA programs, Rust programs, Sparkov Data Generation, Custo…

…m Containers, Python Script, Scripting Bacalhau with Python
bacalhau-project · Jan 26, 2024 · 324af3e · 324af3e
1 parent 36bb1fc
commit 324af3e
Show file tree

Hide file tree

Showing 7 changed files with 282 additions and 166 deletions.
diff --git a/docs/docs/setting-up/workload-onboarding/CUDA/index.md b/docs/docs/setting-up/workload-onboarding/CUDA/index.md
@@ -5,8 +5,6 @@ sidebar_position: 10
 # Run CUDA programs on Bacalhau
 
 
-[![stars - badge-generator](https://img.shields.io/github/stars/bacalhau-project/bacalhau?style=social)](https://github.com/bacalhau-project/bacalhau)
-
 ### What is CUDA
 
 In this tutorial, we will look at how to run CUDA programs on Bacalhau. CUDA (Compute Unified Device Architecture) is an extension of C/C++ programming. It is a parallel computing platform and programming model created by NVIDIA. It helps developers speed up their applications by harnessing the power of GPU accelerators.
@@ -51,10 +49,8 @@ wget -P inputs https://raw.githubusercontent.com/tristanpenman/cuda-examples/mas
 1. **`00-hello-world.cu`**:
 
 ```bash
-%%bash
-
 # View the contents of the standard C++ program
-cat inputs/00-hello-world.cu
+!cat inputs/00-hello-world.cu
 
 # Measure the time it takes to compile and run the program
 %%timeit
@@ -66,8 +62,6 @@ This example represents a standard C++ program that inefficiently utilizes GPU r
 2. **`02-cuda-hello-world-faster.cu`**:
 
 ```bash
-%%bash
-
 # View the contents of the CUDA program with vector addition
 !cat inputs/02-cuda-hello-world-faster.cu
 
@@ -116,6 +110,10 @@ Note that there is `;` between the commands:
 `./outputs/hello`: Execution hello binary:
 You can combine compilation and execution commands. 
 
+:::info
+Note that the CUDA version will need to be compatible with the graphics card on the host machine.
+:::
+
 When a job is submitted, Bacalhau prints out the related `job_id`. We store that in an environment variable so that we can reuse it later on:
 
 ```python

diff --git a/docs/docs/setting-up/workload-onboarding/Sparkov-Data-Generation/index.md b/docs/docs/setting-up/workload-onboarding/Sparkov-Data-Generation/index.md
@@ -5,26 +5,21 @@ sidebar_position: 11
 # Generate Synthetic Data using Sparkov Data Generation technique
 
 
-[![stars - badge-generator](https://img.shields.io/github/stars/bacalhau-project/bacalhau?style=social)](https://github.com/bacalhau-project/bacalhau)
-
 ## Introduction
 
-A synthetic dataset is generated by algorithms or simulations which has similar characteristics to real-world data. Collecting real-world data, especially data that contains sensitive user data like credit card information, is not possible due to security and privacy concerns. If a data scientist needs to train a model to detect credit fraud they can use synthetically generated data instead of using real data without compromising the privacy of users.
+A synthetic dataset is generated by algorithms or simulations which has similar characteristics to real-world data. Collecting real-world data, especially data that contains sensitive user data like credit card information, is not possible due to security and privacy concerns. If a data scientist needs to train a model to detect credit fraud, they can use synthetically generated data instead of using real data without compromising the privacy of users.
 
 The advantage of using Bacalhau is that you can generate terabytes of synthetic data without having to install any dependencies or store the data locally.
 
-In this example, we will generate synthetic credit card transaction data using the Sparkov program and store the results in IPFS.
-
-## TD;LR
-Run Bacalhau on a synthetic dataset.
+In this example, we will learn how to run Bacalhau on a synthetic dataset. We will generate synthetic credit card transaction data using the Sparkov program and store the results in IPFS.
 
-## Prerequisite
+### Prerequisite
 
-To get started, you need to install the Bacalhau client, see more information [here](https://docs.bacalhau.org/getting-started/installation)
+To get started, you need to install the Bacalhau client, see more information [here](../../../getting-started/installation.md)
 
-## Running Sparkov Locally
+## 1. Running Sparkov Locally
 
-To run Sparkov locally, you'll need to clone the repo and install dependencies.
+To run Sparkov locally, you'll need to clone the repo and install dependencies:
 
 
 
@@ -34,40 +29,40 @@ git clone https://github.com/js-ts/Sparkov_Data_Generation/
 pip3 install -r Sparkov_Data_Generation/requirements.txt
 ```
 
-Go to the Sparkov_Data_Generation directory
+Go to the `Sparkov_Data_Generation` directory:
 
 
 ```python
 %cd Sparkov_Data_Generation
 ```
 
-Creating a temporary directory to store the outputs
+Create a temporary directory (`outputs`) to store the outputs:
 
 
 ```bash
 %%bash
 mkdir ../outputs
 ```
 
-## Running the script
-
-After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job, run the following Bacalhau command:
-
+## 2. Running the script
 
 ```bash
 %%bash
 python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022"
 ```
 
-Below are some of the parameters you need before running the script
+The command above executes the Python script `datagen.py`, passing the following arguments to it:
 
-- `-n`:  Number of customers to generate
+`-n 1000`:  Number of customers to generate
 
-- `-o`: path to store the outputs
+`-o ../outputs`: path to store the outputs
 
-- `Start date`: "01-01-2022"
+`"01-01-2022"`: Start date
+
+`"10-01-2022"`: End date
+
+Thus, this command uses a Python script to generate synthetic credit card transaction data for the period from `01-01-2022` to `10-01-2022` and saves the results in the `../outputs` directory.
 
-- `End date`: "10-01-2022"
 
 To see the full list of options, use:
 
@@ -77,18 +72,14 @@ To see the full list of options, use:
 python datagen.py -h
 ```
 
-## Containerize Script using Docker
+## 3. Containerize Script using Docker
 
-:::info
-You can skip this entirely and directly go to running on Bacalhau.
-:::
-
-If you want any additional dependencies to be installed along with DuckDB, you need to build your own container.
-
-To build your own docker container, create a `Dockerfile`, which contains instructions to build your DuckDB docker container.
+To build your own docker container, create a `Dockerfile`, which contains instructions to build your image:
 
 
 ```
+%%writefile Dockerfile
+
 FROM python:3.8
 
 RUN apt update && apt install git
@@ -100,31 +91,33 @@ WORKDIR /Sparkov_Data_Generation/
 RUN pip3 install -r requirements.txt
 ```
 
+These commands specify how the image will be built, and what extra requirements will be included. We use `python:3.8` as the base image, install `git`, clone the `Sparkov_Data_Generation` repository from GitHub, set the working directory inside the container to `/Sparkov_Data_Generation/`, and install Python dependencies listed in the `requirements.txt` file."
+
 :::info
 See more information on how to containerize your script/app [here](https://docs.docker.com/get-started/02_our_app/)
 :::
 
 
 ### Build the container
 
-We will run `docker build` command to build the container;
+We will run `docker build` command to build the container:
 
 ```
 docker build -t <hub-user>/<repo-name>:<tag> .
 ```
 
-Before running the command replace;
+Before running the command replace:
 
-- **hub-user** with your docker hub username, If you don’t have a docker hub account [follow these instructions to create Docker account](https://docs.docker.com/docker-id/), and use the username of the account you created
+**`hub-user`** with your docker hub username. If you don’t have a docker hub account [follow these instructions to create docker account](https://docs.docker.com/docker-id/), and use the username of the account you created
 
-- **repo-name** with the name of the container, you can name it anything you want
+**`repo-name`** with the name of the container, you can name it anything you want
 
-- **tag** this is not required but you can use the latest tag
+**`tag`** this is not required but you can use the `latest` tag
 
 In our case:
 
 ```
-docker build -t jsacex/sparkov-data-generation
+docker build -t jsacex/sparkov-data-generation .
 ```
 
 ### Push the container
@@ -144,55 +137,50 @@ docker push jsacex/sparkov-data-generation
 
 After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau
 
-## Running a Bacalhau Job
+## 4. Running a Bacalhau Job
 
 
-Now we're ready to run a Bacalhau job. This code runs a job, downloads the results, and prints the stdout.
-
-Copy and paste the following code to your terminal
+Now we're ready to run a Bacalhau job: 
 
 
 ```bash
 %%bash --out job_id
 bacalhau docker run \
---id-only \
---wait \
-jsacex/sparkov-data-generation \
---  python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022"
+    --id-only \
+    --wait \
+    jsacex/sparkov-data-generation \
+    --  python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022"
 ```
 
-### Structure of the command
-
-Let's look closely at the command above:
+### Structure of the command:
 
-* `bacalhau docker run`: call to bacalhau
+`bacalhau docker run`: call to Bacalhau
 
-* `jsacex/sparkov-data-generation`: the name and the tag of the docker image we are using
+`jsacex/sparkov-data-generation`: the name of the docker image we are using
 
-* `-o ../outputs "01-01-2022" "10-01-2022`:  path to store the outputs, start date and end-date.
+`--  python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022"`: the arguments passed into the container, specifying the execution of the Python script `datagen.py` with specific parameters, such as the amount of data, output path, and time range. 
 
-* `python3 datagen.py -n 1000`: execute Sparktov
 
-When a job is submitted, Bacalhau prints out the related `job_id`. We store that in an environment variable so that we can reuse it later on.
+When a job is submitted, Bacalhau prints out the related `job_id`. We store that in an environment variable so that we can reuse it later on:
 
 
 ```python
 %env JOB_ID={job_id}
 ```
 
-## Checking the State of your Jobs
+## 5. Checking the State of your Jobs
 
-- **Job status**: You can check the status of the job using `bacalhau list`.
+**Job status**: You can check the status of the job using `bacalhau list`.
 
 
 ```bash
 %%bash
 bacalhau list --id-filter ${JOB_ID}
 ```
 
-When it says `Completed`, that means the job is done, and we can get the results.
+When it says `Published` or `Completed`, that means the job is done, and we can get the results.
 
-- **Job information**: You can find out more information about your job by using `bacalhau describe`.
+**Job information**: You can find out more information about your job by using `bacalhau describe`.
 
 
 
@@ -201,23 +189,25 @@ When it says `Completed`, that means the job is done, and we can get the results
 bacalhau describe ${JOB_ID}
 ```
 
-- **Job download**: You can download your job results directly by using `bacalhau get`. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
+**Job download**: You can download your job results directly by using `bacalhau get`. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (`results`) and downloaded our job output to be stored in that directory.
 
 
 ```bash
 %%bash
 rm -rf results && mkdir -p results
-bacalhau get $JOB_ID --output-dir results
+bacalhau get ${JOB_ID} --output-dir results
 ```
 
-After the download has finished you should see the following contents in the results directory
 
-## Viewing your Job Output
+## 6. Viewing your Job Output
 
-To view the file, run the following command:
+To view the contents of the current directory, run the following command:
 
 
 ```bash
 %%bash
-ls results/outputs # list the contents of the current directory 
+ls results/outputs  
 ```
+
+## Support
+If you have questions or need support or guidance, please reach out to the [Bacalhau team via Slack](https://bacalhauproject.slack.com/ssb/redirect) (**#general** channel).
diff --git a/docs/docs/setting-up/workload-onboarding/bacalhau-docker-image/index.md b/docs/docs/setting-up/workload-onboarding/bacalhau-docker-image/index.md
@@ -5,16 +5,13 @@ description: How to use the Bacalhau Docker image
 ---
 # Bacalhau Docker Image
 
-
-[![stars - badge-generator](https://img.shields.io/github/stars/bacalhau-project/bacalhau?style=social)](https://github.com/bacalhau-project/bacalhau)
-
 This documentation explains how to use the Bacalhau Docker image to run tasks and manage them using the Bacalhau client.
 
 ## Prerequisites
 
 To get started, you need to install the Bacalhau client (see more information [here](../../../getting-started/installation.md)) and Docker.
 
-## 1. Pull the Docker image
+## 1. Pull the Bacalhau Docker image
 
 The first step is to pull the Bacalhau Docker image from the [Github container registry](https://github.com/orgs/bacalhau-project/packages/container/package/bacalhau).
 
@@ -55,27 +52,32 @@ v1.2.0  v1.2.0
 
 ## 3. Running a Bacalhau Job
 
-To submit a job to Bacalhau, we use the `bacalhau docker run` command:
+In the example below, an Ubuntu-based job runs to print the message 'Hello from Docker Bacalhau: 
 
 ```shell
 docker run -t ghcr.io/bacalhau-project/bacalhau:latest \
     docker run \
         --id-only \
         --wait \
-        ubuntu:latest -- \
-            sh -c 'uname -a && echo "Hello from Docker Bacalhau!"'
+        ubuntu:latest \
+        -- sh -c 'uname -a && echo "Hello from Docker Bacalhau!"'
 ```
 
-In this example, an Ubuntu-based job runs, prints the `Hello from Docker Bacalhau` message, then exits. 
 
 ### Structure of the command
 
 `ghcr.io/bacalhau-project/bacalhau:latest `: Name of the Bacalhau Docker image
 
-`--id-only......`: Output only the job id
+`--id-only`: Output only the job id
+
+`--wait`: Wait for the job to finish
 
 `ubuntu:latest.` Ubuntu container
 
+ `--`: Separate Bacalhau parameters from the command to be executed inside the container
+
+ `sh -c 'uname -a && echo "Hello from Docker Bacalhau!"'`: The command executed inside the container 
+
 Let's have a look at the command execution in the terminal:
 
 ```shell