Merge pull request #259 from guy-singer/examples-updates

visual-layer · Aug 29, 2023 · 64eecac · 64eecac
2 parents 785667b + acf3dfb
commit 64eecac
Show file tree

Hide file tree

Showing 6 changed files with 146 additions and 62 deletions.
diff --git a/examples/analyzing-hf-datasets.ipynb b/examples/analyzing-hf-datasets.ipynb
@@ -18,13 +18,13 @@
     "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb)\n",
     "[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb)\n",
     "\n",
-    "This notebook shows how you can use fastdup to analyze any datasets from [Hugging Face Datasets](https://huggingface.co/docs/datasets/index).\n",
+    "This notebook shows how you can use fastdup to analyze any dataset from [Hugging Face Datasets](https://huggingface.co/docs/datasets/index).\n",
     "\n",
     "We will analyze an image classification dataset for:\n",
     "\n",
-    "+ Duplicates / near-duplicates.\n",
-    "+ Outliers.\n",
-    "+ Wrong labels."
+    "+ Duplicates / near-duplicates\n",
+    "+ Outliers\n",
+    "+ Wrong labels"
    ]
   },
   {
@@ -202,7 +202,7 @@
    "id": "61b315c3",
    "metadata": {},
    "source": [
-    "## Get labels mapping\n",
+    "## Get Labels Mapping\n",
     "\n",
     "Tiny ImageNet follows the original ImageNet class names. Let's download the class mappings `classes.py`."
    ]
@@ -257,7 +257,7 @@
    "id": "edb6463d",
    "metadata": {},
    "source": [
-    "Now we can get the class names by providing the class id. For example"
+    "Now we can get the class names by providing the class ID. For example:"
    ]
   },
   {
@@ -362,7 +362,7 @@
    "source": [
     "## Load Annotations\n",
     "\n",
-    "To load the image labels into fastdup we need to prepare a DataFrame with the following column\n",
+    "To load the image labels into fastdup we need to prepare a DataFrame with the following column:\n",
     "+ `filename`\n",
     "+ `label`\n",
     "+ `split`\n"
@@ -606,7 +606,7 @@
    "id": "1017106b",
    "metadata": {},
    "source": [
-    "There are several methods we can use to inspect the issues found\n",
+    "There are several methods we can use to inspect the issues found:\n",
     "\n",
     "```python\n",
     "fd.vis.duplicates_gallery()    # create a visual gallery of duplicates\n",

diff --git a/examples/analyzing-image-classification-dataset.ipynb b/examples/analyzing-image-classification-dataset.ipynb
@@ -23,13 +23,13 @@
     "\n",
     "This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to analyze an image classification dataset for:\n",
     "\n",
-    "+ Duplicates.\n",
-    "+ Outliers.\n",
-    "+ Wrong labels.\n",
-    "+ Image clusters.\n",
+    "+ Duplicates\n",
+    "+ Outliers\n",
+    "+ Wrong labels\n",
+    "+ Image clusters\n",
     "\n",
     "\n",
-    "> **Note** - No GPU needed! You can run on an instance with only CPU.\n",
+    "> **Note** - No GPU needed! You can run this notebook on a CPU-only instance.\n",
     "\n"
    ]
   },
@@ -214,7 +214,7 @@
    "id": "8aba34e1",
    "metadata": {},
    "source": [
-    "Load the annotation provided with the dataset."
+    "Load the annotations provided with the dataset."
    ]
   },
   {
@@ -328,13 +328,13 @@
    "id": "dfc957bf",
    "metadata": {},
    "source": [
-    "Transform the annotation to fastdup supported format.\n",
+    "Transform the annotations to fastdup supported format.\n",
     "\n",
     "fastdup expects an annotation `DataFrame` that contains the following column:\n",
     "\n",
-    "+ filename - contains the path to the image file.\n",
-    "+ label - contains a label of the image.\n",
-    "+ split - whether the image is subset of the training, validation or test dataset."
+    "+ filename - contains the path to the image file\n",
+    "+ label - contains a label of the image\n",
+    "+ split - whether the image is subset of the training, validation or test dataset"
    ]
   },
   {
@@ -510,16 +510,16 @@
    "source": [
     "## Run fastdup\n",
     "\n",
-    "With the images and annotations, we are now ready to run an analysis."
+    "With the images and annotations ready, we can proceed with running an analysis on the data."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "0a39243e",
    "metadata": {},
    "source": [
-    "+ `input_dir` is the path to the downloaded images.\n",
-    "+ `work_dir` is the path to store the artifacts from the analysis. Optional."
+    "+ `input_dir` is the path to the downloaded images\n",
+    "+ `work_dir` is the path to store the artifacts from the analysis (optional)"
    ]
   },
   {

diff --git a/examples/analyzing-kaggle-datasets.ipynb b/examples/analyzing-kaggle-datasets.ipynb
@@ -18,7 +18,7 @@
     "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-kaggle-datasets.ipynb)\n",
     "[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-kaggle-datasets.ipynb)\n",
     "\n",
-    "This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to analyze any computer vision datasets from [Kaggle](https://kaggle.com)."
+    "This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to analyze any computer vision dataset from [Kaggle](https://kaggle.com)."
    ]
   },
   {
@@ -28,7 +28,7 @@
    "source": [
     "## Install Kaggle API\n",
     "\n",
-    "To load data programmatically from Kaggle we will need to install the [Kaggle API](https://github.com/Kaggle/kaggle-api). The API lets us pull data from Kaggle using Python.\n",
+    "To load data programmatically from Kaggle, we will need to install the [Kaggle API](https://github.com/Kaggle/kaggle-api). The API lets us pull data from Kaggle using Python.\n",
     "\n",
     "To install the API, run:"
    ]
@@ -48,21 +48,21 @@
    "id": "eb3fd4c9-bdfb-4ba9-aef6-528d9811b588",
    "metadata": {},
    "source": [
-    "To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com/ . \n",
+    "Note: to use the Kaggle API, you'll need to sign up for a Kaggle account at https://www.kaggle.com/ . \n",
     "\n",
-    "Then, go to the 'Account' tab and select 'Create API Token'. This will trigger the download of `kaggle.json`, a file containing your API credentials. \n",
+    "Go to the 'Account' tab and select 'Create API Token'. This will trigger the download of `kaggle.json`, a file containing your API credentials. \n",
     "\n",
-    "Place this file in the location `~/.kaggle/kaggle.json` (on Windows in the location `C:\\Users\\<Windows-username>\\.kaggle\\kaggle.json`\n",
+    "Place this file in the location `~/.kaggle/kaggle.json` (on Windows in the location `C:\\Users\\<Windows-username>\\.kaggle\\kaggle.json`)\n",
     "\n",
-    "Read more [here](https://github.com/Kaggle/kaggle-api#api-credentials)."
+    "Fore more information on the Kaggle API, click [here](https://github.com/Kaggle/kaggle-api#api-credentials)."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "4b6ae131-572a-4008-a9c1-6f49b21c029e",
    "metadata": {},
    "source": [
-    "If the set up is done correctly, you should be able to run the kaggle commands on your terminal. For instance, to list kaggle datasets that has the term \"computer vision\" , run:"
+    "If the set up is done correctly, you should be able to run the kaggle commands on your terminal. For instance, to list kaggle datasets that have the term \"computer vision\" , run:"
    ]
   },
   {
@@ -132,7 +132,7 @@
    "id": "622ed625-8e11-4e39-85ed-ae2faf3320a8",
    "metadata": {},
    "source": [
-    "Let's say we're interested to analyze the [RVL-CDIP Test Dataset](https://www.kaggle.com/datasets/pdavpoojan/the-rvlcdip-dataset-test). You can head to the dataset page and click on Copy API command and paste it in your terminal.\n",
+    "Let's say we're interested to analyze the [RVL-CDIP Test Dataset](https://www.kaggle.com/datasets/pdavpoojan/the-rvlcdip-dataset-test). You can head to the dataset page and click on \"Copy API command\" and paste it in your terminal.\n",
     "\n",
     "![image.png](attachment:4ea6f203-55bd-4ca7-817d-ad2a16721ed0.png)"
    ]
@@ -142,7 +142,7 @@
    "id": "dfa2560e-a945-4683-a841-48b9f29f5fa8",
    "metadata": {},
    "source": [
-    "Let's run the command here which would trigger a download of the RVL-CDIP test dataset into your current working directory."
+    "Let's run the command here, which will trigger a download of the RVL-CDIP test dataset into our current working directory."
    ]
   },
   {
@@ -162,7 +162,7 @@
    "source": [
     "Once done, we should have a `the-rvlcdip-dataset-test.zip` in the current directory.\n",
     "\n",
-    "Let's unzip the file for further analysis with fastdup in the next section."
+    "Let's unzip the file to prepare it for further analysis with fastdup in the next section."
    ]
   },
   {
@@ -180,7 +180,7 @@
    "id": "1f8d6b66-3f53-4afb-b040-c5d91a628608",
    "metadata": {},
    "source": [
-    "Once completed, we should have a folder with the name `test/` which contains all the images from the dataset."
+    "Once completed, we should have a folder with the name `test/`, which contains all the images from the dataset."
    ]
   },
   {
@@ -253,7 +253,7 @@
    "id": "a10910f4-b772-400b-96b6-f44b62b97fe0",
    "metadata": {},
    "source": [
-    "To run fastdup, we only need to point `input_dir` to the folder containing images from the dataset."
+    "To run fastdup, we simply point `input_dir` to the folder containing the images from the dataset."
    ]
   },
   {
@@ -336,7 +336,7 @@
    "id": "01b2cace-9e25-48bf-826b-907bad037df9",
    "metadata": {},
    "source": [
-    "From the summary above, we have 1 corrupted image. Let's get some more details with:"
+    "From the summary above, we have 1 corrupted image. Let's get some more details:"
    ]
   },
   {

diff --git a/examples/analyzing-object-detection-dataset.ipynb b/examples/analyzing-object-detection-dataset.ipynb
@@ -19,7 +19,7 @@
     "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb)\n",
     "[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb)\n",
     "\n",
-    "In this tutorial, we will analyze an object detection dataset with bounding boxes and identify potential issues. By the end of the notebook, you'll discover how to load a COCO format bounding box annotations into fastdup and inspect the dataset for issues at the bounding box level."
+    "In this tutorial, we will analyze an object detection dataset with bounding boxes and identify potential issues. By the end of the notebook, you'll discover how to load COCO-format bounding box annotations into fastdup and inspect the dataset for issues at the bounding box level."
    ]
   },
   {
@@ -100,7 +100,7 @@
    },
    "source": [
     "## Download Dataset\n",
-    "We will be using the [COCO minitrain](https://github.com/giddyyupp/coco-minitrain) dataset for this tutorial. COCO minitrain is a curated mini training set with about 25,000 images or 20% of the original [COCO dataset](https://cocodataset.org/#home).\n",
+    "We will be using the [COCO minitrain](https://github.com/giddyyupp/coco-minitrain) dataset for this tutorial. COCO minitrain is a curated mini training set with about 25,000 images, or 20% of the original [COCO dataset](https://cocodataset.org/#home).\n",
     "\n",
     "Let's download the dataset into our local drive."
    ]
@@ -128,9 +128,9 @@
    },
    "source": [
     "## Load Annotations\n",
-    "fastdup expects the annotations to be in a specific format.\n",
+    "fastdup requires the annotations to be in a specific format.\n",
     "\n",
-    "We will use a simple converter to convert the COCO format JSON annotation file into the fastdup annotation dataframe. This converter is applicable to any dataset which uses COCO format."
+    "We will use a simple converter to convert the COCO format JSON annotation file into the fastdup annotation dataframe. This converter is applicable to any dataset that uses the COCO format."
    ]
   },
   {
@@ -272,7 +272,7 @@
    "source": [
     "## Run fastdup\n",
     "\n",
-    "Run fastdup with annotations on the dataset. If you're running on a free Google Colab instance, specify the `num_images` to limit the run to fewer images."
+    "Run fastdup with annotations on the dataset. If you're running on a free Google Colab instance, you may want to specify `num_images` to limit the run to fewer images."
    ]
   },
   {
@@ -293,10 +293,10 @@
     "id": "3b4f5823"
    },
    "source": [
-    "## Class distribution\n",
-    "The dataset contains 25k images and 183k objects, an average of 7.3 objects per image. \n",
+    "## Class Distribution\n",
+    "The dataset contains 25k images and 183k objects, for an average of 7.3 objects per image. \n",
     "\n",
-    "Interestingly, we see a highly unbalanced class distribution, where all 80 coco classes are present here, but there is a strong balance towards the person class, that accounts for over 56k instances (30.6%). Car and Chair classes also contain over 8k instances each, while at the bottom of the list the toaster and hair drier classes contain as few as 40 instances. \n",
+    "Interestingly, we see a highly unbalanced class distribution. All 80 COCO classes are present here, but the distribution of classes is strongly skewed towards the person class, which accounts for over 56k instances (30.6%). Car and Chair classes also contain over 8k instances each, while the toaster and hair drier classes contain as few as 40 instances. \n",
     "\n",
     "Using `Plotly` we get a useful interactive histogram. "
    ]
@@ -184820,7 +184820,7 @@
    },
    "source": [
     "## Outliers\n",
-    "Visualize outliers from the run."
+    "Using fastdup's gallery feature, we can visualize outliers from the run."
    ]
   },
   {
@@ -185963,8 +185963,8 @@
     "id": "c0f1fade"
    },
    "source": [
-    "## Size and shape issues\n",
-    "Objects come in various shapes and sizes, and sometimes objects might be incorrectly labeled or too small to be useful. We will now find the smallest, narrowest and widest objects, and asses their usefulness. "
+    "## Size and Shape Issues\n",
+    "Objects come in various shapes and sizes, and sometimes objects might be incorrectly labeled, or too small to be useful. We will now find the smallest, narrowest and widest objects, and assess their usefulness. "
    ]
   },
   {
@@ -186382,7 +186382,7 @@
    "id": "da5709c9-297e-47cd-9d13-599c4c76a883",
    "metadata": {},
    "source": [
-    "Let's visualize here how the top 3 smallest images look like.\n",
+    "Let's visualize what the 3 smallest images look like.\n",
     "\n",
     "The following image is labeled as a `person` in the dataset."
    ]
@@ -186475,7 +186475,7 @@
    "id": "a5a1f0b1-7a85-46bd-a8c1-ebaad837c85f",
    "metadata": {},
    "source": [
-    "Considering the image size, we can hardly tell if the label is correct."
+    "Considering the image size, it is difficult to discern if the label is correct."
    ]
   },
   {
@@ -186789,7 +186789,7 @@
     "id": "9af6979b"
    },
    "source": [
-    "Look at that! The slices reveal many items that are either tiny (10x10 pixels) or have extreme aspect ratios!"
+    "Using fastdup, we have discovered many items that are either tiny (10x10 pixels) or have extreme aspect ratios!"
    ]
   },
   {
@@ -186799,7 +186799,7 @@
    "source": [
     "## Bad Bounding Boxes\n",
     "\n",
-    "Bounding boxes that are either too small or go beyond image boundaries are flagged as bad bounding box in fastdup.\n",
+    "Bounding boxes that are either too small or go beyond image boundaries are flagged as a bad bounding box in fastdup.\n",
     "\n",
     "We can get a list of bad bounding boxes by reading the `atrain_features.bad.csv` file."
    ]
@@ -186972,7 +186972,7 @@
    "id": "6ea5ddca-4c6f-4f92-afd6-32467ed3a437",
    "metadata": {},
    "source": [
-    "We have found 18,592 (!) bounding boxes that are either too small or go beyond image boundaries. This is 10% of the data! Filtering them would both save us grusome debugging of training errors and failures and help up provide the model with useful size objects."
+    "We have found 18,592 (!) bounding boxes that are either too small or go beyond image boundaries. This is 10% of the data! Filtering them out would both save us gruesome debugging of training errors or failures, and help up provide the model with useful size objects."
    ]
   },
   {
@@ -186983,9 +186983,9 @@
    },
    "source": [
     "## Possible Mislabels\n",
-    "The fastdup similarity search and gallery is a strong tool for finding objects that are possibly mislabeled. By finding each object's nearest neighbors and their classes, we can find objects with classes contradicting their neighbors' - a strong sign for mislabels.\n",
+    "The fastdup similarity search and similarity gallery are strong tools for finding objects that are possibly mislabeled. By finding each object's nearest neighbors and their classes, we can find objects with classes contradicting their neighbors' (a strong sign of mislabels).\n",
     "\n",
-    "Running similarity gallery shows if an image has high similarity with two of its closest neighbors yet has different labels. This helps surface potential mislabeling in the dataset. "
+    "Running the similarity gallery shows if an image has high similarity with two of its closest neighbors, yet has different labels. This helps surface potential mislabeling in the dataset. "
    ]
   },
   {