Skip to content

Conversation

@donatj
Copy link
Owner

@donatj donatj commented May 14, 2025

No description provided.

@donatj donatj changed the title Lots more Bots Lots More Bots May 15, 2025
@donatj
Copy link
Owner Author

donatj commented May 15, 2025

My hesitance in merging this is two fold

  • it benchmarks a fair bit slower per UA
  • it might cause unexpected behaviour if people are varying HTML based on detected UA, and suddenly what was a browser is now properly a bot

@donatj donatj requested a review from Copilot October 31, 2025 09:46
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances bot detection in the user agent parser by adding support for numerous web crawlers and bots. The main changes implement a new pattern-based bot detection mechanism that can identify bots with URL references in their user agent strings, while maintaining backward compatibility for existing bot detection.

  • Added a new regex-based bot detection system that identifies bots by their characteristic (name/version; +http://...) pattern
  • Removed hardcoded bot names from the main browser regex and moved them to the new bot detection logic
  • Added constants for 11 new bot/crawler types that are commonly used
  • Added 116 new test cases for various bot user agents
  • Added browser constant exclusions for lesser-known bots to keep the API surface manageable

Reviewed Changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/UserAgentParser.php Refactored bot detection logic with new regex pattern and added prev capture group to main browser regex
src/UserAgent/Browsers.php Added constants for 11 new commonly-used bot types (archive.org_bot, Discordbot, Google-Read-Aloud, Google-Safety, Pinterestbot, WellKnownBot, wpbot, YandexImages, YandexMobileBot, YandexRCA, YandexUserproxy)
bin/constant_generator.php Added exclusion list for 30 bot types to prevent cluttering the Browsers interface
bin/user_agent_sorter.php Added error handling for JSON decode failures
tests/user_agents.dist.json Added 116 new bot user agent test cases covering various crawlers and bots
README.md Updated documentation with new browser constants

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if(
(
empty($result[BROWSER][0])
|| ($result['prev'][0] !== '')
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition will throw an 'Undefined index' notice if $result['prev'][0] doesn't exist. The regex on line 99 uses (?P<prev>.)? with a ? quantifier, meaning the capture group may not be set. This condition should use isset() or !empty() to check for existence before accessing the array element. Suggested fix: || (isset($result['prev'][0]) && $result['prev'][0] !== '')

Suggested change
|| ($result['prev'][0] !== '')
|| (isset($result['prev'][0]) && $result['prev'][0] !== '')

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants