updated readme & reqs, prep for public, fix code style

Cornul11 · Apr 8, 2024 · f6c854f · f6c854f
1 parent 5947182
commit f6c854f
Show file tree

Hide file tree

Showing 15 changed files with 244 additions and 284 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,11 @@
+# local files
+mariadb_data/
+mongodb_data/
+projects/
+projects_metadata/
+evaluation/
+paths/
+
 *.env
 *.properties
 *.cnf
@@ -71,4 +79,4 @@ target/
 *.tar.gz
 *.rar
 
-node_modules
+node_modules
diff --git a/README.md b/README.md
@@ -1,88 +1,90 @@
-# WIP: Project title
+# JarSift
 
 ## Setup
 
-This project requires a functioning MariaDB database. Connection details for this database should be provided in a `config.properties` file, located at the root of the project. It's essential that an empty database exists prior to initiating the process (this can be achieved by running the database initialisation procedure).
+This project requires a functioning MariaDB database. Connection details for this database should be provided in
+a `config.properties` file, located at the root of the project. It's essential that an empty database exists prior to
+initiating the process (this can be achieved by running the database initialisation procedure).
 
 The `config.properties` file should be based on the `config.properties.example` template found in the project root.
 
-Similarly, rename `.env.example` file to `.env` and populate it with the necessary values.
+Similarly, rename `.env.example` file to `.env` and populate it with the respective values.
 
-Lastly, rename the `my-custom.cnf.example` file to `my-custom.cnf` and fill in the appropriate details.
+Lastly, rename the `my-custom.cnf.example` file to `my-custom.cnf` and fill in the appropriate details fitting your
+environment.
 
 ## Execution
 
 There are two key processes in the execution of the project: Corpus Creation and Inference.
 
-
 ### Corpus Creation
 
 Follow the steps below for the corpus creation:
 
-1. Run the command `docker compose up db` or `docker-compose up db` depending on your docker version.
-2. Wait for the internal database initialisation to complete.
-3. Once completed, you can terminate the comtainer.
-4. Proceed by running either `docker-compose up` or `docker compose up` depending on your docker version.
-
+Used to create the paths file which is used to seed the database:
 
-It's crucial to follow this sequence. Prematurely running `docker-compose up` may result in the application failing due to an unprepared database connection.
+```bash
+find /path/to/your/local/.m2/repo \( -name "*.jar" -fprint jar_files.txt \) -o \( -name "*.pom" -fprint pom_files.txt \)
+```
 
-### Inference
-When executing the inference segment, ensure:
+After the paths files have been created, follow the steps below to seed the database:
 
-1. The database is operational.
-2. Appropriate connection credentials are set in `config.properties`.
+1. Run `docker compose up db`.
+2. Wait for the internal database initialisation to complete.
+3. Once completed, you can terminate the container.
+4. Fill in the `PATHS_FILE` environment variable in the `docker-compose.yml` file or the `.env` file with the path to
+   the `jar_files.txt` file created earlier.
+5. Proceed by running `docker compose up`.
 
-Poor verification, execute the following command from the project root:
-```bash
-sh run_inference.sh <path_to_uber_jar>
-```
+It's crucial to follow this sequence. Prematurely running `docker compose up` may result in the application failing due
+to an unprepared database connection.
 
-Used to create the paths file
-```bash
-find /home/dan/.m2/repository \( -name "*.jar" -fprint jar_files.txt \) -o \( -name "*.pom" -fprint pom_files.txt \)
-```
+### Inference
 
+To execute the inference segment, you need to have a MongoDB instance running which you need to seed with the necessary
+data. The data can be found in the `data` directory.
 To seed the MongoDB database:
+
 ```bash
 # Create the MongoDB container
 docker compose up mongodb
 
+# You may use the existing all.zip file, or retrieve the latest data by running the following command (ensure you have gsutil installed)
+gsutil cp gs://osv-vulnerabilities/Maven/all.zip .
+
 # preferably in a venv
 cd util
 pip install -r requirements.txt
-python inport.py all.zip extracted
+python import.py all.zip extracted
 ```
 
-To export the SQL file for usage in SQLite:
+When executing the inference segment, ensure:
+
+1. The corpus database is operational and seeded with the necessary data.
+2. The MongoDB instance is operational and accessible and has been seeded with the necessary data.
+3. Appropriate connection credentials are set in `config.properties`.
+
+For verification, execute the following command from the project root:
+
 ```bash
-mysqldump \
---host 127.0.0.1 \
---user=root --password \
---skip-create-options \
---compatible=ansi \
---skip-extended-insert \
---compact \
---single-transaction \
---no-create-db \
---no-create-info \
---hex-blob \
---skip-quote-names corpus \
-| grep -a "^INSERT INTO" | grep -a -v "__diesel_schema_migrations" \
-| sed 's#\\"#"#gm' \
-| sed -sE "s#,0x([^,]*)#,X'\L\1'#gm" \
-> mysql-to-sqlite.sql
+sh run_inference.sh <path_to_jar>
 ```
 
-To import the SQL file into SQLite:
+## Evaluation
+For the evaluation segment, you must ensure that the corpus database is operational and seeded with the necessary data.
+
+To generate the evaluation data, execute the following command from the project root:
+
+```bash
+sh run_generator.sh <jars per config> <max dependencies per jar>
+```
+
+This will generate the Uber JARs and their respective metadata. This will also run the evaluation process and output the
+results to the `evaluation` directory.
+
+If you have already generated the evaluation data and wish to re-run the evaluation process, execute the following
+command from the project root:
+
 ```bash
-sqlite3 corpus.db
-> CREATE TABLE IF NOT EXISTS libraries (id INTEGER PRIMARY KEY AUTOINCREMENT, group_id TEXT NOT NULL, artifact_id TEXT NOT NULL, version TEXT NOT NULL, jar_hash INTEGER NOT NULL, jar_crc INTEGER NOT NULL, is_uber_jar INTEGER NOT NULL, disk_size INTEGER NOT NULL, total_class_files INTEGER NOT NULL, unique_signatures INTEGER NOT NULL);
-> CREATE TABLE IF NOT EXISTS signatures (id INTEGER PRIMARY KEY AUTOINCREMENT, library_id INTEGER NOT NULL, class_hash TEXT NOT NULL, class_crc INTEGER NOT NULL);
-> PRAGMA synchronous = OFF;
-> PRAGMA journal_mode = MEMORY;
-> PRAGMA auto_vacuum=OFF;
-> PRAGMA index_journal=OFF;
-> PRAGMA temp_store=MEMORY;
-> PRAGMA cache_siz=-256000;
+sh run_evaluation.sh <evaluation data directory>
 ```
diff --git a/script.sh b/script.sh
diff --git a/util/all.zip b/util/all.zip
diff --git a/util/build_table.py b/util/build_table.py
@@ -13,7 +13,12 @@ def generate_latex_table(folder_path):
     latex_table.append(r"Configuration & Threshold & Precision & Recall & F1 Score \\")
     latex_table.append(r"\midrule")
 
-    for configuration in ["Relocation Disabled", "Relocation Enabled", "Minimize Jar Disabled", "Minimize Jar Enabled"]:
+    for configuration in [
+        "Relocation Disabled",
+        "Relocation Enabled",
+        "Minimize Jar Disabled",
+        "Minimize Jar Enabled",
+    ]:
         latex_table.append(f"{configuration} & & & & \\\\")
         for threshold in thresholds:
             filename = os.path.join(folder_path, f"stats_{threshold}.json")

diff --git a/util/clean_json_files.py b/util/clean_json_files.py
@@ -3,26 +3,28 @@
 import re
 import sys
 
+
 def clean_group_id(text):
-    # Remove ANSI escape codes
-    text = re.sub(r'\x1B[@-_][0-?]*[ -/]*[@-~]', '', text)
-    # Remove additional non-alphanumeric characters and text
-    text = re.sub(r'\[INFO\]\s+', '', text)
+    # remove ANSI escape codes
+    text = re.sub(r"\x1B[@-_][0-?]*[ -/]*[@-~]", "", text)
+    # remove additional non-alphanumeric characters and text
+    text = re.sub(r"\[INFO\]\s+", "", text)
     return text.strip()
 
 
 def clean_json_file(file_path, output_path):
-    with open(file_path, 'r') as file:
+    with open(file_path, "r") as file:
         data = json.load(file)
 
     if "effectiveDependencies" in data:
         for dep in data["effectiveDependencies"]:
             dep["groupId"] = clean_group_id(dep["groupId"])
 
-    with open(output_path, 'w') as file:
+    with open(output_path, "w") as file:
         json.dump(data, file, indent=2)
     print(f"Cleaned and saved to {output_path}")
 
+
 def main(directory_path):
     if not os.path.exists(directory_path):
         print("The provided directory does not exist.")
@@ -31,11 +33,14 @@ def main(directory_path):
     for file_name in os.listdir(directory_path):
         if file_name.endswith(".json"):
             file_path = os.path.join(directory_path, file_name)
-            output_path = os.path.join(directory_path, file_name.replace(".json", "_cleaned.json"))
+            output_path = os.path.join(
+                directory_path, file_name.replace(".json", "_cleaned.json")
+            )
             clean_json_file(file_path, output_path)
 
+
 if __name__ == "__main__":
     if len(sys.argv) != 2:
         print("Usage: " + sys.argv[0] + " <directory path>")
     else:
-        main(sys.argv[1])
+        main(sys.argv[1])
diff --git a/util/collect_most_popular_pkgs.py b/util/collect_most_popular_pkgs.py
@@ -36,5 +36,3 @@ def get_top_libraries(n):
     libraries = get_top_libraries(int(sys.argv[1]))
     with open("libraries.json", "w") as file:
         json.dump(libraries, file, indent=4)
-
-