-
Notifications
You must be signed in to change notification settings - Fork 2.3k
feat: BrightData WebSearch #9719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: BrightData WebSearch #9719
Conversation
Signed-off-by: rafaeljohn9 <[email protected]>
Pull Request Test Coverage Report for Build 17039275902Details
💛 - Coveralls |
Design Challenge with BrightData WebSearch Component - Seeking Architectural GuidanceHi @sjrl , I'm currently developing a BrightData WebSearch component for Haystack and have encountered a significant design dilemma that I'd like to discuss with you and @meirk-brd who originally requested this feature. The Challenge: For standard WebSearch components, we expect a clean return format: dict[str, Union[list[Document], list[str]]] However, BrightData's API is a bit different. These are some of the approaches I was considering: 1. Use JSON Support Across Search Engines
Limitation:
2. Data Processing
Limitation:
My Core Concern: I believe we're trying to force BrightData into the same box as other web search APIs, when BrightData is fundamentally different and much more capable. Unlike simple search APIs, BrightData handles:
The Question: Should we continue trying to conform BrightData to the standard WebSearch component interface, or would it be more appropriate to create a distinct component type that better leverages BrightData's unique capabilities? Adding BeautifulSoup parsing and engine-specific logic feels like we're compromising the elegance of Haystack's component architecture for a tool that deserves its own specialized approach. @meirk-brd, I'd love to hear your thoughts on this as well, since you have the most context on BrightData's intended use cases. I'd appreciate both of your thoughts on the best path forward - whether to persist with the current approach or explore a more BrightData-native component design. |
Hi @RafaelJohn9 , Thank you VERY much for your work on this PR - to answer your questions, we have an option to return a search page as a markdown : You can use this to scrape Search results as well, so for google you can use brd_json=1 -> output only meaningful fields that are relevant and for the rest, you can just return their results as markdown so it will be LLM friendly (and wont blow up the context window with HTML tags) A good example is our MCP server : https://github.com/brightdata/brightdata-mcp while the implementation is in JS you can see how we are returning the results as markdown there. Let me know if you have any additional questions on this matter ! |
Hey @anakin87, I have a dilemma. Normally, when using BrightData for scraping, you get results similar to what you’d see in a browser. As @meirk-brd mentioned above, we can use a However, the main issue lies in parsing. The I could add the markdown library for parsing, but that would require implementing different methods to handle results from various search engines (Google, Bing, Yandex, etc.). Should I go for it? |
Hey @RafaelJohn9 thanks for you work on this! A few questions and comments:
Update: I just saw this comment from @julian-risch which is why I see you went for using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If using the BrightData Python SDK makes things easier, I agree with @sjrl that a new Haystack integration would be best. I'm sorry if my comment in the other thread was leading in a more complicated direction. With an additional dependency, the new BrigthData web search component would be a perfect fit for https://github.com/deepset-ai/haystack-core-integrations
It will be indeed much easier for you to use it and we are also working on Improving it right now so it should be a great fit for this PR! |
Related Issues
Proposed Changes:
BrightData
web search component.How did you test it?
unit tests (coming soon)
Notes for the reviewer
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
and added!
in case the PR includes breaking changes.