Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datafusion Native llm.txt #13501

Open
ChristianCasazza opened this issue Nov 20, 2024 · 3 comments
Open

Datafusion Native llm.txt #13501

ChristianCasazza opened this issue Nov 20, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@ChristianCasazza
Copy link

Is your feature request related to a problem or challenge?

LLMs provide a fantastic way to learn and use a new codebase. By providing the documentation, they can create a custom guide for new users to directly learn how to use a library or API, answer specific questions, and help address bugs with custom implementation of the library. One trend that is emerging because of this paradigm is to create two versions of documentation. One version is the traditional version and built for humans, so that it separates the different parts of the codebase into separate parts that is easy for a human to view. The other version of the docs would be optimized for LLMs. It would exist as a single markdown file that removes all of the human formatting and puts all of the information in one central file that can easily be copy and pasted into an LLM chat, which could then teach the user the library.

DuckDB recently released their version of an llm.txt here. it is basically one huge markdown file that includes all of their documentation, totaling about 700k tokens. Caleb Fahlgren from Huggingface extracted the data into an organized version here

I would like to propose making a Datafusion version of these LLM docs

Describe the solution you'd like

I would propose making a few versions of the llm.txt for datafusion. We should make one version that includes all of the docs in one large md file. This is a strong start. However, it is likely that it will be so large that doesn't fit within common chat LLMs context windows. Therefore, I would also suggest making smaller versions of the different sections of the docs, such as architecture, API, etc. The individual sections could make it easier for a user to selectively give the specific context for a particular part of datafusion they are working with, so as not to overload the LLM context.

After some trial and error, I would also suggest creating fully LLM optimized versions. These versions would include a mixture of conceptual explanations of datafusion, example code snippets, and the raw API interface. The goal for the final versions would be simple templates that can be copy and pasted into a chat, which then primes the LLM to have the context of the latest version of datafusion, along with the knowledge of working examples.

Describe alternatives you've considered

One alternative to LLM docs that is popular are companies just simply creating their own chat bot that has the context of their docs. While this is useful, I think it misses the point. I believe in the future it can be assumed that developers are already paying for their own LLM, through chat(ChatGPT, Claude), IDEs(Cursor), or their own set-up working with the LLM APIs.

Therefore, I don't think it is the best model for each company to have their own chatbot LLM. It becomes difficult for a user to combine context across different libraries they use together, and they must iterate in a companies chosen interface instead of the LLM interface they are already comfortable with.

Instead, it would be better to provide the raw context and allow users to bring it into the LLM interface they are already using.

Additional context

I think LLM-paired development is the future of data engineering, and having first class LLM support is vital for the adoption of datafusion. As an example, we could look at pandas and polars. Even though polars offers massive improvements over pandas, there is an order of magnitude more public code examples of pandas than polars. Therefore, LLMs will often suggest pandas code first and often create better working code compared to polars. Even though the underlying library of polars is better, I think many new developers will just use whatever works best with LLMs. I believe this is part of why DuckDB has been very popular, as LLMs are already good at creating SQL compared to dataframes.

By creating first class LLM support for datafusion, I think it can be positioned to gain developer mindshare as modern Arrow based engines become common sense to use.

@ChristianCasazza ChristianCasazza added the enhancement New feature or request label Nov 20, 2024
@timsaucer
Copy link
Contributor

This sounds great. I suspect we would want to do something also in the datafusion-python repository.

@tbar4
Copy link

tbar4 commented Nov 20, 2024

Thirded for Ballista Docs

@2010YOUY01
Copy link
Contributor

This is a great idea.

I would propose making a few versions of the llm.txt for datafusion. We should make one version that includes all of the docs in one large md file. This is a strong start. However, it is likely that it will be so large that doesn't fit within common chat LLMs context windows.

I've been using Cursor+Claude for a while, and its code generation for DataFusion is shockingly good.
And it's using RAG to index everything, so context length is likely not a problem for Cursor, I'm curious if adding more indexable documents can make it better
For example this reading list with very good quality https://datafusion.apache.org/user-guide/concepts-readings-events.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants