tutorial_french issues in the Narrative Model #95

Candelaria-Retamal · 2023-07-26T13:34:58Z

Hello! I have been using the codes recently to analyze text. Yet, I encounter an error when executing the "tutorial_french" file. I have already tried with different versions of Python, but the error (see below) is not solved. I have tried changing all of the parameters in the NarrativeModel, and found out that the issue is the clustering parameter. I tried reinstalling the hdbscan, but the error persists. For the tutorial_english I did not have any problem. I appreaciate your help in this issue.

Thank you in advance!

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[13], line 15
      1 from relatio.narrative_models import NarrativeModel
      3 m = NarrativeModel(
      4     clustering = 'hdbscan',
      5     PCA = False,
   (...)
     12     threshold = 0.3
     13 )    
---> 15 m.fit(postproc_roles, weight_by_frequency = True)

File ~/anaconda3/envs/adsml/lib/python3.8/site-packages/relatio/narrative_models.py:158, in NarrativeModel.fit(self, srl_res, pca_args, umap_args, cluster_args, weight_by_frequency, progress_bar)
    156     print("No fitting required, this model is deterministic!")
    157 if self.clustering in ["hdbscan", "kmeans"]:
--> 158     self.fit_static_clustering(
    159         srl_res,
    160         pca_args,
    161         umap_args,
    162         cluster_args,
    163         weight_by_frequency,
    164         progress_bar,
    165     )
    166 if self.clustering == "dynamic":
    167     pass

File ~/anaconda3/envs/adsml/lib/python3.8/site-packages/relatio/narrative_models.py:362, in NarrativeModel.fit_static_clustering(self, srl_res, pca_args, umap_args, cluster_args, weight_by_frequency, progress_bar)
    355     if k not in [
    356         "min_cluster_size",
    357         "min_samples",
    358         "cluster_selection_method",
    359     ]:
    360         args[k] = v
--> 362 hdb = hdbscan.HDBSCAN(**args).fit(self.training_vectors)
    363 models.append(hdb)
    364 score = hdbscan.validity.validity_index(
    365     self.training_vectors.astype(np.float64), hdb.labels_
    366 )

File ~/anaconda3/envs/adsml/lib/python3.8/site-packages/hdbscan/hdbscan_.py:1205, in HDBSCAN.fit(self, X, y)
   1195 kwargs.pop("prediction_data", None)
   1196 kwargs.update(self._metric_kwargs)
   1198 (
   1199     self.labels_,
   1200     self.probabilities_,
   1201     self.cluster_persistence_,
   1202     self._condensed_tree,
   1203     self._single_linkage_tree,
   1204     self._min_spanning_tree,
-> 1205 ) = hdbscan(clean_data, **kwargs)
   1207 if self.metric != "precomputed" and not self._all_finite:
   1208     # remap indices to align with original data in the case of non-finite entries.
   1209     self._condensed_tree = remap_condensed_tree(
   1210         self._condensed_tree, internal_to_raw, outliers
   1211     )

File ~/anaconda3/envs/adsml/lib/python3.8/site-packages/hdbscan/hdbscan_.py:824, in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, max_cluster_size, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    820 elif metric in KDTREE_VALID_METRICS:
    821     # TO DO: Need heuristic to decide when to go to boruvka;
    822     # still debugging for now
    823     if X.shape[1] > 60:
--> 824         (single_linkage_tree, result_min_span_tree) = memory.cache(
    825             _hdbscan_prims_kdtree
    826         )(
    827             X,
    828             min_samples,
    829             alpha,
    830             metric,
    831             p,
    832             leaf_size,
    833             gen_min_span_tree,
    834             **kwargs
    835         )
    836     else:
    837         (single_linkage_tree, result_min_span_tree) = memory.cache(
    838             _hdbscan_boruvka_kdtree
    839         )(
   (...)
    849             **kwargs
    850         )

File ~/anaconda3/envs/adsml/lib/python3.8/site-packages/joblib/memory.py:349, in NotMemorizedFunc.__call__(self, *args, **kwargs)
    348 def __call__(self, *args, **kwargs):
--> 349     return self.func(*args, **kwargs)

File ~/anaconda3/envs/adsml/lib/python3.8/site-packages/hdbscan/hdbscan_.py:265, in _hdbscan_prims_kdtree(X, min_samples, alpha, metric, p, leaf_size, gen_min_span_tree, **kwargs)
    260 core_distances = tree.query(
    261     X, k=min_samples + 1, dualtree=True, breadth_first=True
    262 )[0][:, -1].copy(order="C")
    264 # Mutual reachability distance is implicit in mst_linkage_core_vector
--> 265 min_spanning_tree = mst_linkage_core_vector(X, core_distances, dist_metric, alpha)
    267 # Sort edges of the min_spanning_tree by weight
    268 min_spanning_tree = min_spanning_tree[np.argsort(min_spanning_tree.T[2]), :]

File hdbscan/_hdbscan_linkage.pyx:55, in hdbscan._hdbscan_linkage.mst_linkage_core_vector()

File hdbscan/_hdbscan_linkage.pyx:165, in hdbscan._hdbscan_linkage.mst_linkage_core_vector()

TypeError: 'float' object cannot be interpreted as an integer

The text was updated successfully, but these errors were encountered:

PinchOfData · 2023-07-27T13:44:02Z

Hi there,
This is a general bug on scikit-learn's side related to Cython versions.
See the issue here: scikit-learn-contrib/hdbscan#600
We will have to let them figure this out on their side before we can do anything about it.
In the meantime, I would suggest relying on the KMeans algorithm.
Sorry for the inconvenience!
Best,
Germain

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tutorial_french issues in the Narrative Model #95

tutorial_french issues in the Narrative Model #95

Candelaria-Retamal commented Jul 26, 2023 •

edited

Loading

PinchOfData commented Jul 27, 2023

tutorial_french issues in the Narrative Model #95

tutorial_french issues in the Narrative Model #95

Comments

Candelaria-Retamal commented Jul 26, 2023 • edited Loading

PinchOfData commented Jul 27, 2023

Candelaria-Retamal commented Jul 26, 2023 •

edited

Loading