Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tutorial_french issues in the Narrative Model #95

Open
Candelaria-Retamal opened this issue Jul 26, 2023 · 1 comment
Open

tutorial_french issues in the Narrative Model #95

Candelaria-Retamal opened this issue Jul 26, 2023 · 1 comment

Comments

@Candelaria-Retamal
Copy link

Candelaria-Retamal commented Jul 26, 2023

Hello! I have been using the codes recently to analyze text. Yet, I encounter an error when executing the "tutorial_french" file. I have already tried with different versions of Python, but the error (see below) is not solved. I have tried changing all of the parameters in the NarrativeModel, and found out that the issue is the clustering parameter. I tried reinstalling the hdbscan, but the error persists. For the tutorial_english I did not have any problem. I appreaciate your help in this issue.

Thank you in advance!

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[13], line 15
      1 from relatio.narrative_models import NarrativeModel
      3 m = NarrativeModel(
      4     clustering = 'hdbscan',
      5     PCA = False,
   (...)
     12     threshold = 0.3
     13 )    
---> 15 m.fit(postproc_roles, weight_by_frequency = True)

File ~/anaconda3/envs/adsml/lib/python3.8/site-packages/relatio/narrative_models.py:158, in NarrativeModel.fit(self, srl_res, pca_args, umap_args, cluster_args, weight_by_frequency, progress_bar)
    156     print("No fitting required, this model is deterministic!")
    157 if self.clustering in ["hdbscan", "kmeans"]:
--> 158     self.fit_static_clustering(
    159         srl_res,
    160         pca_args,
    161         umap_args,
    162         cluster_args,
    163         weight_by_frequency,
    164         progress_bar,
    165     )
    166 if self.clustering == "dynamic":
    167     pass

File ~/anaconda3/envs/adsml/lib/python3.8/site-packages/relatio/narrative_models.py:362, in NarrativeModel.fit_static_clustering(self, srl_res, pca_args, umap_args, cluster_args, weight_by_frequency, progress_bar)
    355     if k not in [
    356         "min_cluster_size",
    357         "min_samples",
    358         "cluster_selection_method",
    359     ]:
    360         args[k] = v
--> 362 hdb = hdbscan.HDBSCAN(**args).fit(self.training_vectors)
    363 models.append(hdb)
    364 score = hdbscan.validity.validity_index(
    365     self.training_vectors.astype(np.float64), hdb.labels_
    366 )

File ~/anaconda3/envs/adsml/lib/python3.8/site-packages/hdbscan/hdbscan_.py:1205, in HDBSCAN.fit(self, X, y)
   1195 kwargs.pop("prediction_data", None)
   1196 kwargs.update(self._metric_kwargs)
   1198 (
   1199     self.labels_,
   1200     self.probabilities_,
   1201     self.cluster_persistence_,
   1202     self._condensed_tree,
   1203     self._single_linkage_tree,
   1204     self._min_spanning_tree,
-> 1205 ) = hdbscan(clean_data, **kwargs)
   1207 if self.metric != "precomputed" and not self._all_finite:
   1208     # remap indices to align with original data in the case of non-finite entries.
   1209     self._condensed_tree = remap_condensed_tree(
   1210         self._condensed_tree, internal_to_raw, outliers
   1211     )

File ~/anaconda3/envs/adsml/lib/python3.8/site-packages/hdbscan/hdbscan_.py:824, in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, max_cluster_size, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    820 elif metric in KDTREE_VALID_METRICS:
    821     # TO DO: Need heuristic to decide when to go to boruvka;
    822     # still debugging for now
    823     if X.shape[1] > 60:
--> 824         (single_linkage_tree, result_min_span_tree) = memory.cache(
    825             _hdbscan_prims_kdtree
    826         )(
    827             X,
    828             min_samples,
    829             alpha,
    830             metric,
    831             p,
    832             leaf_size,
    833             gen_min_span_tree,
    834             **kwargs
    835         )
    836     else:
    837         (single_linkage_tree, result_min_span_tree) = memory.cache(
    838             _hdbscan_boruvka_kdtree
    839         )(
   (...)
    849             **kwargs
    850         )

File ~/anaconda3/envs/adsml/lib/python3.8/site-packages/joblib/memory.py:349, in NotMemorizedFunc.__call__(self, *args, **kwargs)
    348 def __call__(self, *args, **kwargs):
--> 349     return self.func(*args, **kwargs)

File ~/anaconda3/envs/adsml/lib/python3.8/site-packages/hdbscan/hdbscan_.py:265, in _hdbscan_prims_kdtree(X, min_samples, alpha, metric, p, leaf_size, gen_min_span_tree, **kwargs)
    260 core_distances = tree.query(
    261     X, k=min_samples + 1, dualtree=True, breadth_first=True
    262 )[0][:, -1].copy(order="C")
    264 # Mutual reachability distance is implicit in mst_linkage_core_vector
--> 265 min_spanning_tree = mst_linkage_core_vector(X, core_distances, dist_metric, alpha)
    267 # Sort edges of the min_spanning_tree by weight
    268 min_spanning_tree = min_spanning_tree[np.argsort(min_spanning_tree.T[2]), :]

File hdbscan/_hdbscan_linkage.pyx:55, in hdbscan._hdbscan_linkage.mst_linkage_core_vector()

File hdbscan/_hdbscan_linkage.pyx:165, in hdbscan._hdbscan_linkage.mst_linkage_core_vector()

TypeError: 'float' object cannot be interpreted as an integer
@PinchOfData
Copy link
Collaborator

Hi there,
This is a general bug on scikit-learn's side related to Cython versions.
See the issue here: scikit-learn-contrib/hdbscan#600
We will have to let them figure this out on their side before we can do anything about it.
In the meantime, I would suggest relying on the KMeans algorithm.
Sorry for the inconvenience!
Best,
Germain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants