feat: BrightData WebSearch #9719

RafaelJohn9 · 2025-08-18T11:29:13Z

Related Issues

fixes Bright Data web search component #9595
Will add release note once the component is fully complete.

Proposed Changes:

Adds a BrightData web search component.

How did you test it?

unit tests (coming soon)

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I documented my code
I ran pre-commit hooks and fixed any issue

Signed-off-by: rafaeljohn9 <[email protected]>

coveralls · 2025-08-18T11:33:10Z

Pull Request Test Coverage Report for Build 17039275902

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.3%) to 91.917%

Totals
Change from base Build 17037038734:	-0.3%
Covered Lines:	12895
Relevant Lines:	14029

💛 - Coveralls

RafaelJohn9 · 2025-08-18T12:58:42Z

Design Challenge with BrightData WebSearch Component - Seeking Architectural Guidance

Hi @sjrl ,

I'm currently developing a BrightData WebSearch component for Haystack and have encountered a significant design dilemma that I'd like to discuss with you and @meirk-brd who originally requested this feature.

The Challenge:

For standard WebSearch components, we expect a clean return format:

dict[str, Union[list[Document], list[str]]]

However, BrightData's API is a bit different.

These are some of the approaches I was considering:

1. Use JSON Support Across Search Engines

Google: Can return structured JSON using brd_json=1 (e.g., https://www.google.com/search?q=pizza&brd_json=1)

Limitation:

Other engines (Bing, Yahoo, DuckDuckGo): Require different formatting approaches

2. Data Processing

Would need BeautifulSoup as an additional dependency for HTML parsing

Limitation:

Requires engine-specific parsing logic that's heavily tailored to BrightData
Introduces inconsistent data structures depending on the search provider used

My Core Concern:

I believe we're trying to force BrightData into the same box as other web search APIs, when BrightData is fundamentally different and much more capable. Unlike simple search APIs, BrightData handles:

Advanced proxy management
Bypassing geo-restrictions and blocked sites
Anti-bot detection circumvention
Complex data extraction from protected sources
CAPTCHA solving
Rate limiting and session management

The Question:

Should we continue trying to conform BrightData to the standard WebSearch component interface, or would it be more appropriate to create a distinct component type that better leverages BrightData's unique capabilities?

Adding BeautifulSoup parsing and engine-specific logic feels like we're compromising the elegance of Haystack's component architecture for a tool that deserves its own specialized approach.

@meirk-brd, I'd love to hear your thoughts on this as well, since you have the most context on BrightData's intended use cases.

I'd appreciate both of your thoughts on the best path forward - whether to persist with the current approach or explore a more BrightData-native component design.

meirk-brd · 2025-08-18T13:20:14Z

Hi @RafaelJohn9 ,

Thank you VERY much for your work on this PR - to answer your questions, we have an option to return a search page as a markdown :
https://docs.brightdata.com/scraping-automation/web-unlocker/features#scrape-as-markdown

You can use this to scrape Search results as well, so for google you can use brd_json=1 -> output only meaningful fields that are relevant and for the rest, you can just return their results as markdown so it will be LLM friendly (and wont blow up the context window with HTML tags)

A good example is our MCP server : https://github.com/brightdata/brightdata-mcp while the implementation is in JS you can see how we are returning the results as markdown there.

Let me know if you have any additional questions on this matter !

RafaelJohn9 · 2025-08-22T09:52:04Z

Hey @anakin87, I have a dilemma. Normally, when using BrightData for scraping, you get results similar to what you’d see in a browser. As @meirk-brd mentioned above, we can use a markdown format, which is definitely LLM-friendly (see: https://gist.github.com/RafaelJohn9/239fda2045db70704075d9e080c7cfd0).

However, the main issue lies in parsing. The websearch component model returns dict[str, Union[list[Document], list[str]]], which means we expect a parsed response where a Document refers to a single search result.

I could add the markdown library for parsing, but that would require implementing different methods to handle results from various search engines (Google, Bing, Yandex, etc.).

Should I go for it?

sjrl · 2025-08-22T11:19:13Z

Hey @RafaelJohn9 thanks for you work on this! A few questions and comments:

Is there a reason we aren't using the BrightData Python SDK https://github.com/brightdata/bright-data-sdk-python ? It seems like that could help reduce boiler plate code and help with some of the issues you are facing.
I believe this component should probably live as it's own integration in https://github.com/deepset-ai/haystack-core-integrations. You can find some recent PRs of how new integrations are added here and check out our contribution guidelines here

Update: I just saw this comment from @julian-risch which is why I see you went for using requests. In that case I think we should assess if the SDK can help solve this parsing dilemma gracefully without us needing to create a custom parsing solution.

julian-risch

If using the BrightData Python SDK makes things easier, I agree with @sjrl that a new Haystack integration would be best. I'm sorry if my comment in the other thread was leading in a more complicated direction. With an additional dependency, the new BrigthData web search component would be a perfect fit for https://github.com/deepset-ai/haystack-core-integrations

meirk-brd · 2025-08-27T06:08:50Z

If using the BrightData Python SDK makes things easier

It will be indeed much easier for you to use it and we are also working on Improving it right now so it should be a great fit for this PR!

feat: initial version of bright_data websearch

b0f6b9e

Signed-off-by: rafaeljohn9 <[email protected]>

github-actions bot added the type:documentation Improvements on the docs label Aug 18, 2025

RafaelJohn9 changed the title ~~feat: initial version of bright_data websearch~~ feat: BrightData WebSearch Aug 18, 2025

sjrl requested a review from julian-risch August 22, 2025 11:21

julian-risch reviewed Aug 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: BrightData WebSearch #9719

feat: BrightData WebSearch #9719

Uh oh!

RafaelJohn9 commented Aug 18, 2025 •

edited

Loading

Uh oh!

coveralls commented Aug 18, 2025

Uh oh!

RafaelJohn9 commented Aug 18, 2025

Uh oh!

meirk-brd commented Aug 18, 2025

Uh oh!

RafaelJohn9 commented Aug 22, 2025

Uh oh!

sjrl commented Aug 22, 2025 •

edited

Loading

Uh oh!

julian-risch left a comment

Uh oh!

meirk-brd commented Aug 27, 2025

Uh oh!

Uh oh!

feat: BrightData WebSearch #9719

Are you sure you want to change the base?

feat: BrightData WebSearch #9719

Uh oh!

Conversation

RafaelJohn9 commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

coveralls commented Aug 18, 2025

Pull Request Test Coverage Report for Build 17039275902

Details

💛 - Coveralls

Uh oh!

RafaelJohn9 commented Aug 18, 2025

Design Challenge with BrightData WebSearch Component - Seeking Architectural Guidance

1. Use JSON Support Across Search Engines

2. Data Processing

Uh oh!

meirk-brd commented Aug 18, 2025

Uh oh!

RafaelJohn9 commented Aug 22, 2025

Uh oh!

sjrl commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julian-risch left a comment

Choose a reason for hiding this comment

Uh oh!

meirk-brd commented Aug 27, 2025

Uh oh!

Uh oh!

RafaelJohn9 commented Aug 18, 2025 •

edited

Loading

sjrl commented Aug 22, 2025 •

edited

Loading