Skip to content

Commit

Permalink
updated readme & reqs, prep for public, fix code style
Browse files Browse the repository at this point in the history
  • Loading branch information
Cornul11 committed Apr 8, 2024
1 parent 5947182 commit f6c854f
Show file tree
Hide file tree
Showing 15 changed files with 244 additions and 284 deletions.
10 changes: 9 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
# local files
mariadb_data/
mongodb_data/
projects/
projects_metadata/
evaluation/
paths/

*.env
*.properties
*.cnf
Expand Down Expand Up @@ -71,4 +79,4 @@ target/
*.tar.gz
*.rar

node_modules
node_modules
104 changes: 53 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,88 +1,90 @@
# WIP: Project title
# JarSift

## Setup

This project requires a functioning MariaDB database. Connection details for this database should be provided in a `config.properties` file, located at the root of the project. It's essential that an empty database exists prior to initiating the process (this can be achieved by running the database initialisation procedure).
This project requires a functioning MariaDB database. Connection details for this database should be provided in
a `config.properties` file, located at the root of the project. It's essential that an empty database exists prior to
initiating the process (this can be achieved by running the database initialisation procedure).

The `config.properties` file should be based on the `config.properties.example` template found in the project root.

Similarly, rename `.env.example` file to `.env` and populate it with the necessary values.
Similarly, rename `.env.example` file to `.env` and populate it with the respective values.

Lastly, rename the `my-custom.cnf.example` file to `my-custom.cnf` and fill in the appropriate details.
Lastly, rename the `my-custom.cnf.example` file to `my-custom.cnf` and fill in the appropriate details fitting your
environment.

## Execution

There are two key processes in the execution of the project: Corpus Creation and Inference.


### Corpus Creation

Follow the steps below for the corpus creation:

1. Run the command `docker compose up db` or `docker-compose up db` depending on your docker version.
2. Wait for the internal database initialisation to complete.
3. Once completed, you can terminate the comtainer.
4. Proceed by running either `docker-compose up` or `docker compose up` depending on your docker version.

Used to create the paths file which is used to seed the database:

It's crucial to follow this sequence. Prematurely running `docker-compose up` may result in the application failing due to an unprepared database connection.
```bash
find /path/to/your/local/.m2/repo \( -name "*.jar" -fprint jar_files.txt \) -o \( -name "*.pom" -fprint pom_files.txt \)
```

### Inference
When executing the inference segment, ensure:
After the paths files have been created, follow the steps below to seed the database:

1. The database is operational.
2. Appropriate connection credentials are set in `config.properties`.
1. Run `docker compose up db`.
2. Wait for the internal database initialisation to complete.
3. Once completed, you can terminate the container.
4. Fill in the `PATHS_FILE` environment variable in the `docker-compose.yml` file or the `.env` file with the path to
the `jar_files.txt` file created earlier.
5. Proceed by running `docker compose up`.

Poor verification, execute the following command from the project root:
```bash
sh run_inference.sh <path_to_uber_jar>
```
It's crucial to follow this sequence. Prematurely running `docker compose up` may result in the application failing due
to an unprepared database connection.

Used to create the paths file
```bash
find /home/dan/.m2/repository \( -name "*.jar" -fprint jar_files.txt \) -o \( -name "*.pom" -fprint pom_files.txt \)
```
### Inference

To execute the inference segment, you need to have a MongoDB instance running which you need to seed with the necessary
data. The data can be found in the `data` directory.
To seed the MongoDB database:

```bash
# Create the MongoDB container
docker compose up mongodb

# You may use the existing all.zip file, or retrieve the latest data by running the following command (ensure you have gsutil installed)
gsutil cp gs://osv-vulnerabilities/Maven/all.zip .

# preferably in a venv
cd util
pip install -r requirements.txt
python inport.py all.zip extracted
python import.py all.zip extracted
```

To export the SQL file for usage in SQLite:
When executing the inference segment, ensure:

1. The corpus database is operational and seeded with the necessary data.
2. The MongoDB instance is operational and accessible and has been seeded with the necessary data.
3. Appropriate connection credentials are set in `config.properties`.

For verification, execute the following command from the project root:

```bash
mysqldump \
--host 127.0.0.1 \
--user=root --password \
--skip-create-options \
--compatible=ansi \
--skip-extended-insert \
--compact \
--single-transaction \
--no-create-db \
--no-create-info \
--hex-blob \
--skip-quote-names corpus \
| grep -a "^INSERT INTO" | grep -a -v "__diesel_schema_migrations" \
| sed 's#\\"#"#gm' \
| sed -sE "s#,0x([^,]*)#,X'\L\1'#gm" \
> mysql-to-sqlite.sql
sh run_inference.sh <path_to_jar>
```

To import the SQL file into SQLite:
## Evaluation
For the evaluation segment, you must ensure that the corpus database is operational and seeded with the necessary data.

To generate the evaluation data, execute the following command from the project root:

```bash
sh run_generator.sh <jars per config> <max dependencies per jar>
```

This will generate the Uber JARs and their respective metadata. This will also run the evaluation process and output the
results to the `evaluation` directory.

If you have already generated the evaluation data and wish to re-run the evaluation process, execute the following
command from the project root:

```bash
sqlite3 corpus.db
> CREATE TABLE IF NOT EXISTS libraries (id INTEGER PRIMARY KEY AUTOINCREMENT, group_id TEXT NOT NULL, artifact_id TEXT NOT NULL, version TEXT NOT NULL, jar_hash INTEGER NOT NULL, jar_crc INTEGER NOT NULL, is_uber_jar INTEGER NOT NULL, disk_size INTEGER NOT NULL, total_class_files INTEGER NOT NULL, unique_signatures INTEGER NOT NULL);
> CREATE TABLE IF NOT EXISTS signatures (id INTEGER PRIMARY KEY AUTOINCREMENT, library_id INTEGER NOT NULL, class_hash TEXT NOT NULL, class_crc INTEGER NOT NULL);
> PRAGMA synchronous = OFF;
> PRAGMA journal_mode = MEMORY;
> PRAGMA auto_vacuum=OFF;
> PRAGMA index_journal=OFF;
> PRAGMA temp_store=MEMORY;
> PRAGMA cache_siz=-256000;
sh run_evaluation.sh <evaluation data directory>
```
82 changes: 0 additions & 82 deletions script.sh

This file was deleted.

Binary file modified util/all.zip
Binary file not shown.
7 changes: 6 additions & 1 deletion util/build_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,12 @@ def generate_latex_table(folder_path):
latex_table.append(r"Configuration & Threshold & Precision & Recall & F1 Score \\")
latex_table.append(r"\midrule")

for configuration in ["Relocation Disabled", "Relocation Enabled", "Minimize Jar Disabled", "Minimize Jar Enabled"]:
for configuration in [
"Relocation Disabled",
"Relocation Enabled",
"Minimize Jar Disabled",
"Minimize Jar Enabled",
]:
latex_table.append(f"{configuration} & & & & \\\\")
for threshold in thresholds:
filename = os.path.join(folder_path, f"stats_{threshold}.json")
Expand Down
21 changes: 13 additions & 8 deletions util/clean_json_files.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,28 @@
import re
import sys


def clean_group_id(text):
# Remove ANSI escape codes
text = re.sub(r'\x1B[@-_][0-?]*[ -/]*[@-~]', '', text)
# Remove additional non-alphanumeric characters and text
text = re.sub(r'\[INFO\]\s+', '', text)
# remove ANSI escape codes
text = re.sub(r"\x1B[@-_][0-?]*[ -/]*[@-~]", "", text)
# remove additional non-alphanumeric characters and text
text = re.sub(r"\[INFO\]\s+", "", text)
return text.strip()


def clean_json_file(file_path, output_path):
with open(file_path, 'r') as file:
with open(file_path, "r") as file:
data = json.load(file)

if "effectiveDependencies" in data:
for dep in data["effectiveDependencies"]:
dep["groupId"] = clean_group_id(dep["groupId"])

with open(output_path, 'w') as file:
with open(output_path, "w") as file:
json.dump(data, file, indent=2)
print(f"Cleaned and saved to {output_path}")


def main(directory_path):
if not os.path.exists(directory_path):
print("The provided directory does not exist.")
Expand All @@ -31,11 +33,14 @@ def main(directory_path):
for file_name in os.listdir(directory_path):
if file_name.endswith(".json"):
file_path = os.path.join(directory_path, file_name)
output_path = os.path.join(directory_path, file_name.replace(".json", "_cleaned.json"))
output_path = os.path.join(
directory_path, file_name.replace(".json", "_cleaned.json")
)
clean_json_file(file_path, output_path)


if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: " + sys.argv[0] + " <directory path>")
else:
main(sys.argv[1])
main(sys.argv[1])
2 changes: 0 additions & 2 deletions util/collect_most_popular_pkgs.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,5 +36,3 @@ def get_top_libraries(n):
libraries = get_top_libraries(int(sys.argv[1]))
with open("libraries.json", "w") as file:
json.dump(libraries, file, indent=4)


Loading

0 comments on commit f6c854f

Please sign in to comment.