Skip to content

Conversation

RafaelJohn9
Copy link
Contributor

@RafaelJohn9 RafaelJohn9 commented Aug 18, 2025

Related Issues

Proposed Changes:

  • Adds a BrightData web search component.

How did you test it?

unit tests (coming soon)

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@github-actions github-actions bot added the type:documentation Improvements on the docs label Aug 18, 2025
@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 17039275902

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.3%) to 91.917%

Totals Coverage Status
Change from base Build 17037038734: -0.3%
Covered Lines: 12895
Relevant Lines: 14029

💛 - Coveralls

@RafaelJohn9 RafaelJohn9 changed the title feat: initial version of bright_data websearch feat: BrightData WebSearch Aug 18, 2025
@RafaelJohn9
Copy link
Contributor Author

Design Challenge with BrightData WebSearch Component - Seeking Architectural Guidance

Hi @sjrl ,

I'm currently developing a BrightData WebSearch component for Haystack and have encountered a significant design dilemma that I'd like to discuss with you and @meirk-brd who originally requested this feature.

The Challenge:

For standard WebSearch components, we expect a clean return format:

dict[str, Union[list[Document], list[str]]]

However, BrightData's API is a bit different.

These are some of the approaches I was considering:

1. Use JSON Support Across Search Engines

  • Google: Can return structured JSON using brd_json=1 (e.g., https://www.google.com/search?q=pizza&brd_json=1)

Limitation:

  • Other engines (Bing, Yahoo, DuckDuckGo): Require different formatting approaches

2. Data Processing

  • Would need BeautifulSoup as an additional dependency for HTML parsing

Limitation:

  • Requires engine-specific parsing logic that's heavily tailored to BrightData
  • Introduces inconsistent data structures depending on the search provider used

My Core Concern:

I believe we're trying to force BrightData into the same box as other web search APIs, when BrightData is fundamentally different and much more capable. Unlike simple search APIs, BrightData handles:

  • Advanced proxy management
  • Bypassing geo-restrictions and blocked sites
  • Anti-bot detection circumvention
  • Complex data extraction from protected sources
  • CAPTCHA solving
  • Rate limiting and session management

The Question:

Should we continue trying to conform BrightData to the standard WebSearch component interface, or would it be more appropriate to create a distinct component type that better leverages BrightData's unique capabilities?

Adding BeautifulSoup parsing and engine-specific logic feels like we're compromising the elegance of Haystack's component architecture for a tool that deserves its own specialized approach.

@meirk-brd, I'd love to hear your thoughts on this as well, since you have the most context on BrightData's intended use cases.

I'd appreciate both of your thoughts on the best path forward - whether to persist with the current approach or explore a more BrightData-native component design.

@meirk-brd
Copy link

Hi @RafaelJohn9 ,

Thank you VERY much for your work on this PR - to answer your questions, we have an option to return a search page as a markdown :
https://docs.brightdata.com/scraping-automation/web-unlocker/features#scrape-as-markdown

You can use this to scrape Search results as well, so for google you can use brd_json=1 -> output only meaningful fields that are relevant and for the rest, you can just return their results as markdown so it will be LLM friendly (and wont blow up the context window with HTML tags)

A good example is our MCP server : https://github.com/brightdata/brightdata-mcp while the implementation is in JS you can see how we are returning the results as markdown there.

Let me know if you have any additional questions on this matter !

@RafaelJohn9
Copy link
Contributor Author

Hey @anakin87, I have a dilemma. Normally, when using BrightData for scraping, you get results similar to what you’d see in a browser. As @meirk-brd mentioned above, we can use a markdown format, which is definitely LLM-friendly (see: https://gist.github.com/RafaelJohn9/239fda2045db70704075d9e080c7cfd0).

However, the main issue lies in parsing. The websearch component model returns dict[str, Union[list[Document], list[str]]], which means we expect a parsed response where a Document refers to a single search result.

I could add the markdown library for parsing, but that would require implementing different methods to handle results from various search engines (Google, Bing, Yandex, etc.).

Should I go for it?

@sjrl
Copy link
Contributor

sjrl commented Aug 22, 2025

Hey @RafaelJohn9 thanks for you work on this! A few questions and comments:

Update: I just saw this comment from @julian-risch which is why I see you went for using requests. In that case I think we should assess if the SDK can help solve this parsing dilemma gracefully without us needing to create a custom parsing solution.

@sjrl sjrl requested a review from julian-risch August 22, 2025 11:21
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using the BrightData Python SDK makes things easier, I agree with @sjrl that a new Haystack integration would be best. I'm sorry if my comment in the other thread was leading in a more complicated direction. With an additional dependency, the new BrigthData web search component would be a perfect fit for https://github.com/deepset-ai/haystack-core-integrations

@meirk-brd
Copy link

If using the BrightData Python SDK makes things easier

It will be indeed much easier for you to use it and we are also working on Improving it right now so it should be a great fit for this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bright Data web search component
5 participants