Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The table 'dataset_keyword_tables' is too large, resulting in insufficient server storage #8399

Open
5 tasks done
starweavingdream opened this issue Sep 14, 2024 · 5 comments
Assignees
Labels
🐞 bug Something isn't working

Comments

@starweavingdream
Copy link

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.8.0

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Hello, I am using a PG database and there is a table named 'dataset_keyword_tables' in the database with a large amount of data; Through testing, uploading a file of about 1M, the size increment of this table can reach 300M, which is strange. Now my table has 300 GB, and I want to know what specific use this table has? Can I clean it up? Otherwise, as more and more knowledge base documents are uploaded, this table will become larger and larger.

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

Copy link

dosubot bot commented Sep 14, 2024

Hey @starweavingdream! I'm here to help you with any bugs, questions, or contributions you have for the project. Let me know how I can assist you!

The dataset_keyword_tables table in the PG database is used to store keyword tables associated with datasets. Each entry in this table includes a unique identifier, the dataset ID it is associated with, the keyword table itself (stored as text), and the data source type.

To clean up the dataset_keyword_tables table and prevent it from becoming excessively large, you can remove entries that are no longer needed. This can be done by identifying and deleting rows that are associated with datasets that are no longer in use or have been deleted.

Here is an example of how you might perform such a cleanup operation:

from extensions.ext_database import db
from models.dataset import Dataset, DatasetKeywordTable

def cleanup_dataset_keyword_tables():
    # Find all dataset IDs that are still in use
    active_dataset_ids = {dataset.id for dataset in Dataset.query.all()}

    # Find all entries in the dataset_keyword_tables
    all_keyword_tables = DatasetKeywordTable.query.all()

    # Delete entries that are not associated with active datasets
    for keyword_table in all_keyword_tables:
        if keyword_table.dataset_id not in active_dataset_ids:
            db.session.delete(keyword_table)

    # Commit the changes to the database
    db.session.commit()

# Call the cleanup function
cleanup_dataset_keyword_tables()

Additionally, Dify has an automated cleanup mechanism in the form of a scheduled task named clean_unused_datasets_task that manages the size of the 'dataset_keyword_tables' table. This task removes unused datasets and their associated indexes if they haven't been updated within a specified number of days [1].

For more details on the structure and indexing of the table, you can refer to the initial database setup in the Dify platform [2][3].

If you need to limit the size of the 'dataset_keyword_tables' table, you can use the max_keywords_per_chunk setting in the KeywordTableConfig class, which controls the number of keywords extracted per chunk of text [4].

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added the 🐞 bug Something isn't working label Sep 14, 2024
@Scallions
Copy link

Same problem. Did you find any solution? I found that this table is only used by keyword search. When indexing document, this table will slow down the process.

@starweavingdream
Copy link
Author

starweavingdream commented Oct 15, 2024

@Scallions I tried the solution given by the robot, and there were no particularly obvious problems after I cleared the data. If you want to try my method, I suggest you back up your data first.

@glacierck
Copy link

@starweavingdream I need you! I'm going nuts!
image

@luckylhb90
Copy link
Contributor

same problems. ....
so difficult.....
The system has a lot to change.....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants