[Dataset] Question about the 20newsgroup dataset used for text classification #1

HoAnhKhoaVN · 2024-06-25T04:12:58Z

After thorough review of your article and preliminary experimentation with the 20newsgroup dataset as suggested, I have encountered a few areas of uncertainty that I hope you could clarify to enhance the precision and scientific integrity of my work:

Based on the dataset sizes D_1 = 11,314 and D_2 = 7,532 mentioned in your paper, am I correct in understanding that you have utilized both training and test sets to segment the data into D_1 and D_2?
In the GitHub repository (https://github.com/Crisp-Unimib/ContrXT), under the directory tests/test_data, there are two DataFrames: df_time_1.csv with 8,486 entries, and df_time_2.csv with 4,533 entries. Could you please elaborate on how these specific subsets were generated from the original 20newsgroup data?
I am keen to understand which specific subsets of data were used to train the text classification model for generating the experimental results.
Your example on GitHub illustrates training two independent models on datasets D_1 and D_2, respectively. I am trying to comprehend whether this method effectively captures the changes in feature importance between the two training phases, t_1 and t_2. From my understanding, training on D_1 at t_1 and subsequently using those weights to train on D_2 at t_2 might better reflect the evolution of features. Could you provide your insights on this approach?
The training data used in your models appears to retain certain metadata elements such as headers, footers, and quotes, which are typically removed in 20newsgroup data processing as recommended by sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html). Would a more meticulous data cleaning enhance the meaningfulness of the model's explanations?

I am grateful for your pioneering work in this field and eagerly anticipate your guidance to refine my research further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dataset] Question about the 20newsgroup dataset used for text classification #1

[Dataset] Question about the 20newsgroup dataset used for text classification #1

HoAnhKhoaVN commented Jun 25, 2024

[Dataset] Question about the 20newsgroup dataset used for text classification #1

[Dataset] Question about the 20newsgroup dataset used for text classification #1

Comments

HoAnhKhoaVN commented Jun 25, 2024