ArtiFact: A Large-Scale Dataset with Artificial and Factual Images for Generalizable and Robust Synthetic Image Detection [ICIP 2023]

Paper:

IEEE Xplore: https://ieeexplore.ieee.org/document/10222083
ArXiv: https://arxiv.org/abs/2302.11970

Abstract: Synthetic image generation has opened up new opportunities but has also created threats in regard to privacy, authenticity, and security. Detecting fake images is of paramount importance to prevent illegal activities, and previous research has shown that generative models leave unique patterns in their synthetic images that can be exploited to detect them. However, the fundamental problem of generalization remains, as even state-of-the-art detectors encounter difficulty when facing generators never seen during training. To assess the generalizability and robustness of synthetic image detectors in the face of real-world impairments, this paper presents a large-scale dataset named ArtiFact, comprising diverse generators, object categories, and real-world challenges. Moreover, the proposed multi-class classification scheme, combined with a filter stride reduction strategy addresses social platform impairments and effectively detects synthetic images from both seen and unseen generators. The proposed solution significantly outperforms other top teams by 8.34% on Test 1, 1.26% on Test 2, and 15.08% on Test 3 in the IEEE VIP Cup challenge at ICIP 2022, as measured by the accuracy metric.

Presentation:

Visual Summary:

Update

[22 June 2023] - The work has been accepted to IEEE ICIP 2023 conference.

Result on IEEE VIP Cup at ICIP 2022

Accuracy (%) of Top3 Teams on Leaderboard,

Team Names	Test 1	Test 2	Test 3
Sherlock	87.70	77.52	73.45
FAU Erlangen-Nürnberg	87.14	81.74	75.52
Megatron (Ours)	96.04	83.00	90.60

Note: A small portion of the proposed ArtiFact dataset, totaling 222K images of 71K real images and 151K fake images from only 13 generators is used in the IEEE VIP Cup. Here all the Test data is kept confidential from all participating teams. Additionally, the generators used for the Test 1 data are known to all teams, whereas the generators for Test 2 and Test 3 are kept undisclosed.

Dataset Description

Total number of images: $2,496,738$
Number of real images: $964,989$
Number of fake images: $1,531,749$
Number of generators used for fake images: $25$ (including $13$ GANs, $7$ Diffusion, and $5$ miscellaneous generators)
Number of sources used for real images: $8$
Categories included in the dataset: Human/Human Faces, Animal/Animal Faces, Places, Vehicles, Art, and other real-life objects
Image Resolution: $200 \times 200$

Data Distribution

Real

Fake

Download Dataset

The dataset is hosted on Kaggle. The dataset can be downloaded i) directly from the browser using the link below or ii) can be downloaded using kaggle-api.

i) Directly from Browser

Link: ArtiFact Dataset

ii) Kaggle API

!kaggle datasets download -d awsaf49/artifact-dataset

How to Use

The dataset is organized into folders, each of which corresponds to a specific generator of synthetic images or source of real images. Each folder contains a metadata.csv file, which provides information about the images in the folder. It contains following columns,

image_path : The relative path of the image file.
target : The label for the image, which is either 0 for real or 1 for fake.
category : The category (cat or dog etc) of the image

Data Generation

Images are randomly sampled from different methods then transformed using impairments. The methods are listed below,

Methods


Method	ImageNet	COCO	LSUN	AFHQ	FFHQ	Metfaces	CelebAHQ	Landscape	Glide	StyleGAN2	StyleGAN3	Generative Inpainting	Taming Transformer	MAT	LaMa	Stable Diffusion	VQ Diffusion	Palette	StyleGAN1	Latent Diffusion	CIPS	StarGAN	BigGAN	GANformer	ProjectedGAN	SFHQ	FaceSynthetics	Denoising Diffusion GAN	DDPM	DiffusionGAN	GauGAN	ProGAN	CycleGAN
Reference	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link	link

All images went through RandomCrop and Random Impairments (Jpeg Compression & Downscale). To apply these transformation use data/transform.py which applies random transformation. All images are cropped and resized to $200 \times 200$ pixels and then compressed using JPEG at a random quality level.

!python data/transform.py <input directory> <output directory> <seed>

Citation

@INPROCEEDINGS{artifact,
  author={Rahman, Md Awsafur and Paul, Bishmoy and Sarker, Najibul Haque and Hakim, Zaber Ibn Abdul and Fattah, Shaikh Anowarul},
  booktitle={2023 IEEE International Conference on Image Processing (ICIP)}, 
  title={Artifact: A Large-Scale Dataset With Artificial And Factual Images For Generalizable And Robust Synthetic Image Detection}, 
  year={2023},
  volume={},
  number={},
  pages={2200-2204},
  doi={10.1109/ICIP49359.2023.10222083}}

License

ArtiFact dataset takes leverage of data from multiple methods thus different parts of the dataset come with different licenses. All the methods and their associated licenses are mentioned in the table,

Data License

Method	License
ImageNet	Non Commercial
COCO	Creative Commons Attribution 4.0 License
LSUN	Unknown
AFHQ	Creative Commons Attribution-NonCommercial 4.0 International Public
FFHQ	Creative Commons BY-NC-SA 4.0 license
Metfaces	Creative Commons BY-NC 2.0
CelebAHQ	Creative Commons Attribution-NonCommercial 4.0 International Public
Landscape	MIT license
Glide	MIT license
StyleGAN2	Nvidia Source Code License
StyleGAN3	Nvidia Source Code License
Generative Inpainting	Creative Commons Public Licenses
Taming Transformer	MIT License
MAT	Creative Commons Public Licenses
LaMa	Apache-2.0 License
Stable Diffusion	Apache-2.0 License
VQ Diffusion	MIT License
Palette	MIT License
StyleGAN1	Creative Commons Public Licenses
Latent Diffusion	MIT License
CIPS	MIT License
StarGAN	MIT License
BigGAN	MIT License
GANformer	MIT License
ProjectedGAN	MIT License
SFHQ	MIT License
FaceSynthetics	Research Use of Data Agreement v1.0
Denoising Diffusion GAN	NVIDIA License
DDPM	Unknown
DiffusionGAN	MIT License
GauGAN	Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License
ProGAN	Attribution-NonCommercial 4.0 International
CycleGAN	BSD

Acknowledgment

The authors would like to express their gratitude to the IEEE Signal Processing Society, GRIP of the University Federico II of Naples (Italy), and NVIDIA (USA) for hosting the IEEE Video and Image Processing (VIP) Cup competition at ICIP 2022. This competition provided a platform for the authors to showcase their work and motivated them to push their boundaries to deliver a state-of-the-art solution.

The authors also would like to express their gratitude to the authors of the methods that is used for creating ArtiFact dataset. All the methods and their reference is added below,

Data Reference

Method	Reference
ImageNet	link
COCO	link
LSUN	link
AFHQ	link
FFHQ	link
Metfaces	link
CelebAHQ	link
Landscape	link
Glide	link
StyleGAN2	link
StyleGAN3	link
Generative Inpainting	link
Taming Transformer	link
MAT	link
LaMa	link
Stable Diffusion	link
VQ Diffusion	link
Palette	link
StyleGAN1	link
Latent Diffusion	link
CIPS	link
StarGAN	link
BigGAN	link
GANformer	link
ProjectedGAN	link
SFHQ	link
FaceSynthetics	link
Denoising Diffusion GAN	link
DDPM	link
DiffusionGAN	link
GauGAN	link
ProGAN	link
CycleGAN	link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ArtiFact: A Large-Scale Dataset with Artificial and Factual Images for Generalizable and Robust Synthetic Image Detection [ICIP 2023]

Update

Result on IEEE VIP Cup at ICIP 2022

Dataset Description

Data Distribution

Download Dataset

i) Directly from Browser

ii) Kaggle API

How to Use

Data Generation

Citation

License

Acknowledgment

Files

README.md

Latest commit

History

README.md

File metadata and controls

ArtiFact: A Large-Scale Dataset with Artificial and Factual Images for Generalizable and Robust Synthetic Image Detection [ICIP 2023]

Update

Result on IEEE VIP Cup at ICIP 2022

Dataset Description

Data Distribution

Download Dataset

i) Directly from Browser

ii) Kaggle API

How to Use

Data Generation

Citation

License

Acknowledgment