You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Any reason why we eliminate duplicates? Let us say we are scraping from a large internet DB. There could be relations that are strongly reinforced multiple times, other relations which may appear once or twice (in some incorrect website). When trained on the entire corpus, the model will ignore the incorrect relationships. Let us say a relation appears about 100 times in the corpus. 99 of the websites got it right and 1 website got it wrong. If trained with the entire corpus, the correct info will outweigh the wrong one (in the embeddings). If duplicates are eliminated, we get 1 correct and one incorrect relationship which will be impossible to train. I understand this is not a bug and your code is designed to work that way, but I am curious to know your thoughts on this situation and whether me modifying the source code to prevent duplicate elimination could possibly help in getting meaningful embeddings for the above scenario.
The text was updated successfully, but these errors were encountered:
Any reason why we eliminate duplicates? Let us say we are scraping from a large internet DB. There could be relations that are strongly reinforced multiple times, other relations which may appear once or twice (in some incorrect website). When trained on the entire corpus, the model will ignore the incorrect relationships. Let us say a relation appears about 100 times in the corpus. 99 of the websites got it right and 1 website got it wrong. If trained with the entire corpus, the correct info will outweigh the wrong one (in the embeddings). If duplicates are eliminated, we get 1 correct and one incorrect relationship which will be impossible to train. I understand this is not a bug and your code is designed to work that way, but I am curious to know your thoughts on this situation and whether me modifying the source code to prevent duplicate elimination could possibly help in getting meaningful embeddings for the above scenario.
The text was updated successfully, but these errors were encountered: