Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dataset] Question about the 20newsgroup dataset used for text classification #1

Open
HoAnhKhoaVN opened this issue Jun 25, 2024 · 0 comments

Comments

@HoAnhKhoaVN
Copy link

After thorough review of your article and preliminary experimentation with the 20newsgroup dataset as suggested, I have encountered a few areas of uncertainty that I hope you could clarify to enhance the precision and scientific integrity of my work:

  1. Based on the dataset sizes D_1 = 11,314 and D_2 = 7,532 mentioned in your paper, am I correct in understanding that you have utilized both training and test sets to segment the data into D_1 and D_2?
  2. In the GitHub repository (https://github.com/Crisp-Unimib/ContrXT), under the directory tests/test_data, there are two DataFrames: df_time_1.csv with 8,486 entries, and df_time_2.csv with 4,533 entries. Could you please elaborate on how these specific subsets were generated from the original 20newsgroup data?
  3. I am keen to understand which specific subsets of data were used to train the text classification model for generating the experimental results.
    Screenshot 2024-06-25 111154
  4. Your example on GitHub illustrates training two independent models on datasets D_1 and D_2, respectively. I am trying to comprehend whether this method effectively captures the changes in feature importance between the two training phases, t_1 and t_2. From my understanding, training on D_1 at t_1 and subsequently using those weights to train on D_2 at t_2 might better reflect the evolution of features. Could you provide your insights on this approach?
  5. The training data used in your models appears to retain certain metadata elements such as headers, footers, and quotes, which are typically removed in 20newsgroup data processing as recommended by sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html). Would a more meticulous data cleaning enhance the meaningfulness of the model's explanations?

I am grateful for your pioneering work in this field and eagerly anticipate your guidance to refine my research further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant