We recently noticed that someone crawling from idris.fr was impersonating CCBot with the user-agent string.
We contacted IDRIS and they said they would stop doing it.
It would be good for the documentation for img2dataset should say:
- use your own useragent, don't impersonate anyone else
- respect robots.txt and X-robots
- Respect ccbot robots.txt rules and X-robots rules in addition to the rules for your bot