Skip to content

Commit

Permalink
Updating CUDA programs, Rust programs, Sparkov Data Generation, Custo…
Browse files Browse the repository at this point in the history
…m Containers, Python Script, Scripting Bacalhau with Python
  • Loading branch information
nataliyagaranovich committed Jan 26, 2024
1 parent 36bb1fc commit 324af3e
Show file tree
Hide file tree
Showing 7 changed files with 282 additions and 166 deletions.
12 changes: 5 additions & 7 deletions docs/docs/setting-up/workload-onboarding/CUDA/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@ sidebar_position: 10
# Run CUDA programs on Bacalhau


[![stars - badge-generator](https://img.shields.io/github/stars/bacalhau-project/bacalhau?style=social)](https://github.com/bacalhau-project/bacalhau)

### What is CUDA

In this tutorial, we will look at how to run CUDA programs on Bacalhau. CUDA (Compute Unified Device Architecture) is an extension of C/C++ programming. It is a parallel computing platform and programming model created by NVIDIA. It helps developers speed up their applications by harnessing the power of GPU accelerators.
Expand Down Expand Up @@ -51,10 +49,8 @@ wget -P inputs https://raw.githubusercontent.com/tristanpenman/cuda-examples/mas
1. **`00-hello-world.cu`**:

```bash
%%bash

# View the contents of the standard C++ program
cat inputs/00-hello-world.cu
!cat inputs/00-hello-world.cu

# Measure the time it takes to compile and run the program
%%timeit
Expand All @@ -66,8 +62,6 @@ This example represents a standard C++ program that inefficiently utilizes GPU r
2. **`02-cuda-hello-world-faster.cu`**:

```bash
%%bash

# View the contents of the CUDA program with vector addition
!cat inputs/02-cuda-hello-world-faster.cu

Expand Down Expand Up @@ -116,6 +110,10 @@ Note that there is `;` between the commands:
`./outputs/hello`: Execution hello binary:
You can combine compilation and execution commands.

:::info
Note that the CUDA version will need to be compatible with the graphics card on the host machine.
:::

When a job is submitted, Bacalhau prints out the related `job_id`. We store that in an environment variable so that we can reuse it later on:

```python
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,21 @@ sidebar_position: 11
# Generate Synthetic Data using Sparkov Data Generation technique


[![stars - badge-generator](https://img.shields.io/github/stars/bacalhau-project/bacalhau?style=social)](https://github.com/bacalhau-project/bacalhau)

## Introduction

A synthetic dataset is generated by algorithms or simulations which has similar characteristics to real-world data. Collecting real-world data, especially data that contains sensitive user data like credit card information, is not possible due to security and privacy concerns. If a data scientist needs to train a model to detect credit fraud they can use synthetically generated data instead of using real data without compromising the privacy of users.
A synthetic dataset is generated by algorithms or simulations which has similar characteristics to real-world data. Collecting real-world data, especially data that contains sensitive user data like credit card information, is not possible due to security and privacy concerns. If a data scientist needs to train a model to detect credit fraud, they can use synthetically generated data instead of using real data without compromising the privacy of users.

The advantage of using Bacalhau is that you can generate terabytes of synthetic data without having to install any dependencies or store the data locally.

In this example, we will generate synthetic credit card transaction data using the Sparkov program and store the results in IPFS.

## TD;LR
Run Bacalhau on a synthetic dataset.
In this example, we will learn how to run Bacalhau on a synthetic dataset. We will generate synthetic credit card transaction data using the Sparkov program and store the results in IPFS.

## Prerequisite
### Prerequisite

To get started, you need to install the Bacalhau client, see more information [here](https://docs.bacalhau.org/getting-started/installation)
To get started, you need to install the Bacalhau client, see more information [here](../../../getting-started/installation.md)

## Running Sparkov Locally​
## 1. Running Sparkov Locally​

To run Sparkov locally, you'll need to clone the repo and install dependencies.
To run Sparkov locally, you'll need to clone the repo and install dependencies:



Expand All @@ -34,40 +29,40 @@ git clone https://github.com/js-ts/Sparkov_Data_Generation/
pip3 install -r Sparkov_Data_Generation/requirements.txt
```

Go to the Sparkov_Data_Generation directory
Go to the `Sparkov_Data_Generation` directory:


```python
%cd Sparkov_Data_Generation
```

Creating a temporary directory to store the outputs
Create a temporary directory (`outputs`) to store the outputs:


```bash
%%bash
mkdir ../outputs
```

## Running the script

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job, run the following Bacalhau command:

## 2. Running the script

```bash
%%bash
python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022"
```

Below are some of the parameters you need before running the script
The command above executes the Python script `datagen.py`, passing the following arguments to it:

- `-n`: Number of customers to generate
`-n 1000`: Number of customers to generate

- `-o`: path to store the outputs
`-o ../outputs`: path to store the outputs

- `Start date`: "01-01-2022"
`"01-01-2022"`: Start date

`"10-01-2022"`: End date

Thus, this command uses a Python script to generate synthetic credit card transaction data for the period from `01-01-2022` to `10-01-2022` and saves the results in the `../outputs` directory.

- `End date`: "10-01-2022"

To see the full list of options, use:

Expand All @@ -77,18 +72,14 @@ To see the full list of options, use:
python datagen.py -h
```

## Containerize Script using Docker
## 3. Containerize Script using Docker

:::info
You can skip this entirely and directly go to running on Bacalhau.
:::

If you want any additional dependencies to be installed along with DuckDB, you need to build your own container.

To build your own docker container, create a `Dockerfile`, which contains instructions to build your DuckDB docker container.
To build your own docker container, create a `Dockerfile`, which contains instructions to build your image:


```
%%writefile Dockerfile
FROM python:3.8
RUN apt update && apt install git
Expand All @@ -100,31 +91,33 @@ WORKDIR /Sparkov_Data_Generation/
RUN pip3 install -r requirements.txt
```

These commands specify how the image will be built, and what extra requirements will be included. We use `python:3.8` as the base image, install `git`, clone the `Sparkov_Data_Generation` repository from GitHub, set the working directory inside the container to `/Sparkov_Data_Generation/`, and install Python dependencies listed in the `requirements.txt` file."

:::info
See more information on how to containerize your script/app [here](https://docs.docker.com/get-started/02_our_app/)
:::


### Build the container

We will run `docker build` command to build the container;
We will run `docker build` command to build the container:

```
docker build -t <hub-user>/<repo-name>:<tag> .
```

Before running the command replace;
Before running the command replace:

- **hub-user** with your docker hub username, If you don’t have a docker hub account [follow these instructions to create Docker account](https://docs.docker.com/docker-id/), and use the username of the account you created
**`hub-user`** with your docker hub username. If you don’t have a docker hub account [follow these instructions to create docker account](https://docs.docker.com/docker-id/), and use the username of the account you created

- **repo-name** with the name of the container, you can name it anything you want
**`repo-name`** with the name of the container, you can name it anything you want

- **tag** this is not required but you can use the latest tag
**`tag`** this is not required but you can use the `latest` tag

In our case:

```
docker build -t jsacex/sparkov-data-generation
docker build -t jsacex/sparkov-data-generation .
```

### Push the container
Expand All @@ -144,55 +137,50 @@ docker push jsacex/sparkov-data-generation

After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau

## Running a Bacalhau Job
## 4. Running a Bacalhau Job


Now we're ready to run a Bacalhau job. This code runs a job, downloads the results, and prints the stdout.

Copy and paste the following code to your terminal
Now we're ready to run a Bacalhau job:


```bash
%%bash --out job_id
bacalhau docker run \
--id-only \
--wait \
jsacex/sparkov-data-generation \
-- python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022"
--id-only \
--wait \
jsacex/sparkov-data-generation \
-- python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022"
```

### Structure of the command

Let's look closely at the command above:
### Structure of the command:

* `bacalhau docker run`: call to bacalhau
`bacalhau docker run`: call to Bacalhau

* `jsacex/sparkov-data-generation`: the name and the tag of the docker image we are using
`jsacex/sparkov-data-generation`: the name of the docker image we are using

* `-o ../outputs "01-01-2022" "10-01-2022`: path to store the outputs, start date and end-date.
`-- python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022"`: the arguments passed into the container, specifying the execution of the Python script `datagen.py` with specific parameters, such as the amount of data, output path, and time range.

* `python3 datagen.py -n 1000`: execute Sparktov

When a job is submitted, Bacalhau prints out the related `job_id`. We store that in an environment variable so that we can reuse it later on.
When a job is submitted, Bacalhau prints out the related `job_id`. We store that in an environment variable so that we can reuse it later on:


```python
%env JOB_ID={job_id}
```

## Checking the State of your Jobs
## 5. Checking the State of your Jobs

- **Job status**: You can check the status of the job using `bacalhau list`.
**Job status**: You can check the status of the job using `bacalhau list`.


```bash
%%bash
bacalhau list --id-filter ${JOB_ID}
```

When it says `Completed`, that means the job is done, and we can get the results.
When it says `Published` or `Completed`, that means the job is done, and we can get the results.

- **Job information**: You can find out more information about your job by using `bacalhau describe`.
**Job information**: You can find out more information about your job by using `bacalhau describe`.



Expand All @@ -201,23 +189,25 @@ When it says `Completed`, that means the job is done, and we can get the results
bacalhau describe ${JOB_ID}
```

- **Job download**: You can download your job results directly by using `bacalhau get`. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
**Job download**: You can download your job results directly by using `bacalhau get`. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory (`results`) and downloaded our job output to be stored in that directory.


```bash
%%bash
rm -rf results && mkdir -p results
bacalhau get $JOB_ID --output-dir results
bacalhau get ${JOB_ID} --output-dir results
```

After the download has finished you should see the following contents in the results directory

## Viewing your Job Output
## 6. Viewing your Job Output

To view the file, run the following command:
To view the contents of the current directory, run the following command:


```bash
%%bash
ls results/outputs # list the contents of the current directory
ls results/outputs
```

## Support
If you have questions or need support or guidance, please reach out to the [Bacalhau team via Slack](https://bacalhauproject.slack.com/ssb/redirect) (**#general** channel).
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,13 @@ description: How to use the Bacalhau Docker image
---
# Bacalhau Docker Image


[![stars - badge-generator](https://img.shields.io/github/stars/bacalhau-project/bacalhau?style=social)](https://github.com/bacalhau-project/bacalhau)

This documentation explains how to use the Bacalhau Docker image to run tasks and manage them using the Bacalhau client.

## Prerequisites

To get started, you need to install the Bacalhau client (see more information [here](../../../getting-started/installation.md)) and Docker.

## 1. Pull the Docker image
## 1. Pull the Bacalhau Docker image

The first step is to pull the Bacalhau Docker image from the [Github container registry](https://github.com/orgs/bacalhau-project/packages/container/package/bacalhau).

Expand Down Expand Up @@ -55,27 +52,32 @@ v1.2.0 v1.2.0

## 3. Running a Bacalhau Job

To submit a job to Bacalhau, we use the `bacalhau docker run` command:
In the example below, an Ubuntu-based job runs to print the message 'Hello from Docker Bacalhau:

```shell
docker run -t ghcr.io/bacalhau-project/bacalhau:latest \
docker run \
--id-only \
--wait \
ubuntu:latest -- \
sh -c 'uname -a && echo "Hello from Docker Bacalhau!"'
ubuntu:latest \
-- sh -c 'uname -a && echo "Hello from Docker Bacalhau!"'
```

In this example, an Ubuntu-based job runs, prints the `Hello from Docker Bacalhau` message, then exits.

### Structure of the command

`ghcr.io/bacalhau-project/bacalhau:latest `: Name of the Bacalhau Docker image

`--id-only......`: Output only the job id
`--id-only`: Output only the job id

`--wait`: Wait for the job to finish

`ubuntu:latest.` Ubuntu container

`--`: Separate Bacalhau parameters from the command to be executed inside the container

`sh -c 'uname -a && echo "Hello from Docker Bacalhau!"'`: The command executed inside the container

Let's have a look at the command execution in the terminal:

```shell
Expand Down
Loading

0 comments on commit 324af3e

Please sign in to comment.