Skip to content

Commit

Permalink
Merge pull request #259 from guy-singer/examples-updates
Browse files Browse the repository at this point in the history
  • Loading branch information
guy-singer authored Aug 29, 2023
2 parents 785667b + acf3dfb commit 64eecac
Show file tree
Hide file tree
Showing 6 changed files with 146 additions and 62 deletions.
16 changes: 8 additions & 8 deletions examples/analyzing-hf-datasets.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,13 @@
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb)\n",
"[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb)\n",
"\n",
"This notebook shows how you can use fastdup to analyze any datasets from [Hugging Face Datasets](https://huggingface.co/docs/datasets/index).\n",
"This notebook shows how you can use fastdup to analyze any dataset from [Hugging Face Datasets](https://huggingface.co/docs/datasets/index).\n",
"\n",
"We will analyze an image classification dataset for:\n",
"\n",
"+ Duplicates / near-duplicates.\n",
"+ Outliers.\n",
"+ Wrong labels."
"+ Duplicates / near-duplicates\n",
"+ Outliers\n",
"+ Wrong labels"
]
},
{
Expand Down Expand Up @@ -202,7 +202,7 @@
"id": "61b315c3",
"metadata": {},
"source": [
"## Get labels mapping\n",
"## Get Labels Mapping\n",
"\n",
"Tiny ImageNet follows the original ImageNet class names. Let's download the class mappings `classes.py`."
]
Expand Down Expand Up @@ -257,7 +257,7 @@
"id": "edb6463d",
"metadata": {},
"source": [
"Now we can get the class names by providing the class id. For example"
"Now we can get the class names by providing the class ID. For example:"
]
},
{
Expand Down Expand Up @@ -362,7 +362,7 @@
"source": [
"## Load Annotations\n",
"\n",
"To load the image labels into fastdup we need to prepare a DataFrame with the following column\n",
"To load the image labels into fastdup we need to prepare a DataFrame with the following column:\n",
"+ `filename`\n",
"+ `label`\n",
"+ `split`\n"
Expand Down Expand Up @@ -606,7 +606,7 @@
"id": "1017106b",
"metadata": {},
"source": [
"There are several methods we can use to inspect the issues found\n",
"There are several methods we can use to inspect the issues found:\n",
"\n",
"```python\n",
"fd.vis.duplicates_gallery() # create a visual gallery of duplicates\n",
Expand Down
26 changes: 13 additions & 13 deletions examples/analyzing-image-classification-dataset.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,13 @@
"\n",
"This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to analyze an image classification dataset for:\n",
"\n",
"+ Duplicates.\n",
"+ Outliers.\n",
"+ Wrong labels.\n",
"+ Image clusters.\n",
"+ Duplicates\n",
"+ Outliers\n",
"+ Wrong labels\n",
"+ Image clusters\n",
"\n",
"\n",
"> **Note** - No GPU needed! You can run on an instance with only CPU.\n",
"> **Note** - No GPU needed! You can run this notebook on a CPU-only instance.\n",
"\n"
]
},
Expand Down Expand Up @@ -214,7 +214,7 @@
"id": "8aba34e1",
"metadata": {},
"source": [
"Load the annotation provided with the dataset."
"Load the annotations provided with the dataset."
]
},
{
Expand Down Expand Up @@ -328,13 +328,13 @@
"id": "dfc957bf",
"metadata": {},
"source": [
"Transform the annotation to fastdup supported format.\n",
"Transform the annotations to fastdup supported format.\n",
"\n",
"fastdup expects an annotation `DataFrame` that contains the following column:\n",
"\n",
"+ filename - contains the path to the image file.\n",
"+ label - contains a label of the image.\n",
"+ split - whether the image is subset of the training, validation or test dataset."
"+ filename - contains the path to the image file\n",
"+ label - contains a label of the image\n",
"+ split - whether the image is subset of the training, validation or test dataset"
]
},
{
Expand Down Expand Up @@ -510,16 +510,16 @@
"source": [
"## Run fastdup\n",
"\n",
"With the images and annotations, we are now ready to run an analysis."
"With the images and annotations ready, we can proceed with running an analysis on the data."
]
},
{
"cell_type": "markdown",
"id": "0a39243e",
"metadata": {},
"source": [
"+ `input_dir` is the path to the downloaded images.\n",
"+ `work_dir` is the path to store the artifacts from the analysis. Optional."
"+ `input_dir` is the path to the downloaded images\n",
"+ `work_dir` is the path to store the artifacts from the analysis (optional)"
]
},
{
Expand Down
26 changes: 13 additions & 13 deletions examples/analyzing-kaggle-datasets.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-kaggle-datasets.ipynb)\n",
"[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-kaggle-datasets.ipynb)\n",
"\n",
"This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to analyze any computer vision datasets from [Kaggle](https://kaggle.com)."
"This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to analyze any computer vision dataset from [Kaggle](https://kaggle.com)."
]
},
{
Expand All @@ -28,7 +28,7 @@
"source": [
"## Install Kaggle API\n",
"\n",
"To load data programmatically from Kaggle we will need to install the [Kaggle API](https://github.com/Kaggle/kaggle-api). The API lets us pull data from Kaggle using Python.\n",
"To load data programmatically from Kaggle, we will need to install the [Kaggle API](https://github.com/Kaggle/kaggle-api). The API lets us pull data from Kaggle using Python.\n",
"\n",
"To install the API, run:"
]
Expand All @@ -48,21 +48,21 @@
"id": "eb3fd4c9-bdfb-4ba9-aef6-528d9811b588",
"metadata": {},
"source": [
"To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com/ . \n",
"Note: to use the Kaggle API, you'll need to sign up for a Kaggle account at https://www.kaggle.com/ . \n",
"\n",
"Then, go to the 'Account' tab and select 'Create API Token'. This will trigger the download of `kaggle.json`, a file containing your API credentials. \n",
"Go to the 'Account' tab and select 'Create API Token'. This will trigger the download of `kaggle.json`, a file containing your API credentials. \n",
"\n",
"Place this file in the location `~/.kaggle/kaggle.json` (on Windows in the location `C:\\Users\\<Windows-username>\\.kaggle\\kaggle.json`\n",
"Place this file in the location `~/.kaggle/kaggle.json` (on Windows in the location `C:\\Users\\<Windows-username>\\.kaggle\\kaggle.json`)\n",
"\n",
"Read more [here](https://github.com/Kaggle/kaggle-api#api-credentials)."
"Fore more information on the Kaggle API, click [here](https://github.com/Kaggle/kaggle-api#api-credentials)."
]
},
{
"cell_type": "markdown",
"id": "4b6ae131-572a-4008-a9c1-6f49b21c029e",
"metadata": {},
"source": [
"If the set up is done correctly, you should be able to run the kaggle commands on your terminal. For instance, to list kaggle datasets that has the term \"computer vision\" , run:"
"If the set up is done correctly, you should be able to run the kaggle commands on your terminal. For instance, to list kaggle datasets that have the term \"computer vision\" , run:"
]
},
{
Expand Down Expand Up @@ -132,7 +132,7 @@
"id": "622ed625-8e11-4e39-85ed-ae2faf3320a8",
"metadata": {},
"source": [
"Let's say we're interested to analyze the [RVL-CDIP Test Dataset](https://www.kaggle.com/datasets/pdavpoojan/the-rvlcdip-dataset-test). You can head to the dataset page and click on Copy API command and paste it in your terminal.\n",
"Let's say we're interested to analyze the [RVL-CDIP Test Dataset](https://www.kaggle.com/datasets/pdavpoojan/the-rvlcdip-dataset-test). You can head to the dataset page and click on \"Copy API command\" and paste it in your terminal.\n",
"\n",
"![image.png](attachment:4ea6f203-55bd-4ca7-817d-ad2a16721ed0.png)"
]
Expand All @@ -142,7 +142,7 @@
"id": "dfa2560e-a945-4683-a841-48b9f29f5fa8",
"metadata": {},
"source": [
"Let's run the command here which would trigger a download of the RVL-CDIP test dataset into your current working directory."
"Let's run the command here, which will trigger a download of the RVL-CDIP test dataset into our current working directory."
]
},
{
Expand All @@ -162,7 +162,7 @@
"source": [
"Once done, we should have a `the-rvlcdip-dataset-test.zip` in the current directory.\n",
"\n",
"Let's unzip the file for further analysis with fastdup in the next section."
"Let's unzip the file to prepare it for further analysis with fastdup in the next section."
]
},
{
Expand All @@ -180,7 +180,7 @@
"id": "1f8d6b66-3f53-4afb-b040-c5d91a628608",
"metadata": {},
"source": [
"Once completed, we should have a folder with the name `test/` which contains all the images from the dataset."
"Once completed, we should have a folder with the name `test/`, which contains all the images from the dataset."
]
},
{
Expand Down Expand Up @@ -253,7 +253,7 @@
"id": "a10910f4-b772-400b-96b6-f44b62b97fe0",
"metadata": {},
"source": [
"To run fastdup, we only need to point `input_dir` to the folder containing images from the dataset."
"To run fastdup, we simply point `input_dir` to the folder containing the images from the dataset."
]
},
{
Expand Down Expand Up @@ -336,7 +336,7 @@
"id": "01b2cace-9e25-48bf-826b-907bad037df9",
"metadata": {},
"source": [
"From the summary above, we have 1 corrupted image. Let's get some more details with:"
"From the summary above, we have 1 corrupted image. Let's get some more details:"
]
},
{
Expand Down
36 changes: 18 additions & 18 deletions examples/analyzing-object-detection-dataset.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb)\n",
"[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb)\n",
"\n",
"In this tutorial, we will analyze an object detection dataset with bounding boxes and identify potential issues. By the end of the notebook, you'll discover how to load a COCO format bounding box annotations into fastdup and inspect the dataset for issues at the bounding box level."
"In this tutorial, we will analyze an object detection dataset with bounding boxes and identify potential issues. By the end of the notebook, you'll discover how to load COCO-format bounding box annotations into fastdup and inspect the dataset for issues at the bounding box level."
]
},
{
Expand Down Expand Up @@ -100,7 +100,7 @@
},
"source": [
"## Download Dataset\n",
"We will be using the [COCO minitrain](https://github.com/giddyyupp/coco-minitrain) dataset for this tutorial. COCO minitrain is a curated mini training set with about 25,000 images or 20% of the original [COCO dataset](https://cocodataset.org/#home).\n",
"We will be using the [COCO minitrain](https://github.com/giddyyupp/coco-minitrain) dataset for this tutorial. COCO minitrain is a curated mini training set with about 25,000 images, or 20% of the original [COCO dataset](https://cocodataset.org/#home).\n",
"\n",
"Let's download the dataset into our local drive."
]
Expand Down Expand Up @@ -128,9 +128,9 @@
},
"source": [
"## Load Annotations\n",
"fastdup expects the annotations to be in a specific format.\n",
"fastdup requires the annotations to be in a specific format.\n",
"\n",
"We will use a simple converter to convert the COCO format JSON annotation file into the fastdup annotation dataframe. This converter is applicable to any dataset which uses COCO format."
"We will use a simple converter to convert the COCO format JSON annotation file into the fastdup annotation dataframe. This converter is applicable to any dataset that uses the COCO format."
]
},
{
Expand Down Expand Up @@ -272,7 +272,7 @@
"source": [
"## Run fastdup\n",
"\n",
"Run fastdup with annotations on the dataset. If you're running on a free Google Colab instance, specify the `num_images` to limit the run to fewer images."
"Run fastdup with annotations on the dataset. If you're running on a free Google Colab instance, you may want to specify `num_images` to limit the run to fewer images."
]
},
{
Expand All @@ -293,10 +293,10 @@
"id": "3b4f5823"
},
"source": [
"## Class distribution\n",
"The dataset contains 25k images and 183k objects, an average of 7.3 objects per image. \n",
"## Class Distribution\n",
"The dataset contains 25k images and 183k objects, for an average of 7.3 objects per image. \n",
"\n",
"Interestingly, we see a highly unbalanced class distribution, where all 80 coco classes are present here, but there is a strong balance towards the person class, that accounts for over 56k instances (30.6%). Car and Chair classes also contain over 8k instances each, while at the bottom of the list the toaster and hair drier classes contain as few as 40 instances. \n",
"Interestingly, we see a highly unbalanced class distribution. All 80 COCO classes are present here, but the distribution of classes is strongly skewed towards the person class, which accounts for over 56k instances (30.6%). Car and Chair classes also contain over 8k instances each, while the toaster and hair drier classes contain as few as 40 instances. \n",
"\n",
"Using `Plotly` we get a useful interactive histogram. "
]
Expand Down Expand Up @@ -184820,7 +184820,7 @@
},
"source": [
"## Outliers\n",
"Visualize outliers from the run."
"Using fastdup's gallery feature, we can visualize outliers from the run."
]
},
{
Expand Down Expand Up @@ -185963,8 +185963,8 @@
"id": "c0f1fade"
},
"source": [
"## Size and shape issues\n",
"Objects come in various shapes and sizes, and sometimes objects might be incorrectly labeled or too small to be useful. We will now find the smallest, narrowest and widest objects, and asses their usefulness. "
"## Size and Shape Issues\n",
"Objects come in various shapes and sizes, and sometimes objects might be incorrectly labeled, or too small to be useful. We will now find the smallest, narrowest and widest objects, and assess their usefulness. "
]
},
{
Expand Down Expand Up @@ -186382,7 +186382,7 @@
"id": "da5709c9-297e-47cd-9d13-599c4c76a883",
"metadata": {},
"source": [
"Let's visualize here how the top 3 smallest images look like.\n",
"Let's visualize what the 3 smallest images look like.\n",
"\n",
"The following image is labeled as a `person` in the dataset."
]
Expand Down Expand Up @@ -186475,7 +186475,7 @@
"id": "a5a1f0b1-7a85-46bd-a8c1-ebaad837c85f",
"metadata": {},
"source": [
"Considering the image size, we can hardly tell if the label is correct."
"Considering the image size, it is difficult to discern if the label is correct."
]
},
{
Expand Down Expand Up @@ -186789,7 +186789,7 @@
"id": "9af6979b"
},
"source": [
"Look at that! The slices reveal many items that are either tiny (10x10 pixels) or have extreme aspect ratios!"
"Using fastdup, we have discovered many items that are either tiny (10x10 pixels) or have extreme aspect ratios!"
]
},
{
Expand All @@ -186799,7 +186799,7 @@
"source": [
"## Bad Bounding Boxes\n",
"\n",
"Bounding boxes that are either too small or go beyond image boundaries are flagged as bad bounding box in fastdup.\n",
"Bounding boxes that are either too small or go beyond image boundaries are flagged as a bad bounding box in fastdup.\n",
"\n",
"We can get a list of bad bounding boxes by reading the `atrain_features.bad.csv` file."
]
Expand Down Expand Up @@ -186972,7 +186972,7 @@
"id": "6ea5ddca-4c6f-4f92-afd6-32467ed3a437",
"metadata": {},
"source": [
"We have found 18,592 (!) bounding boxes that are either too small or go beyond image boundaries. This is 10% of the data! Filtering them would both save us grusome debugging of training errors and failures and help up provide the model with useful size objects."
"We have found 18,592 (!) bounding boxes that are either too small or go beyond image boundaries. This is 10% of the data! Filtering them out would both save us gruesome debugging of training errors or failures, and help up provide the model with useful size objects."
]
},
{
Expand All @@ -186983,9 +186983,9 @@
},
"source": [
"## Possible Mislabels\n",
"The fastdup similarity search and gallery is a strong tool for finding objects that are possibly mislabeled. By finding each object's nearest neighbors and their classes, we can find objects with classes contradicting their neighbors' - a strong sign for mislabels.\n",
"The fastdup similarity search and similarity gallery are strong tools for finding objects that are possibly mislabeled. By finding each object's nearest neighbors and their classes, we can find objects with classes contradicting their neighbors' (a strong sign of mislabels).\n",
"\n",
"Running similarity gallery shows if an image has high similarity with two of its closest neighbors yet has different labels. This helps surface potential mislabeling in the dataset. "
"Running the similarity gallery shows if an image has high similarity with two of its closest neighbors, yet has different labels. This helps surface potential mislabeling in the dataset. "
]
},
{
Expand Down
Loading

0 comments on commit 64eecac

Please sign in to comment.