Skip to content

Commit 2bb1d77

Browse files
tabergmasanchariGr
andauthored
Replace pickle with safer alternatives (#13067)
* Update slack release notification step * [ENG-1424] Use `pickle` alternatives (#1453) * use json.dump and json.load in count_vectors_featurizer and lexical_syntactic_featurizer instead of pickle * update load and persist in sklearn intent classifier * update persist and load in dietclassifier * update load and persist in sklearn intent classifier * use json.dump and json.load in tracker featurizers * update persist and load of TEDPolicy * updated unexpected intent policy persist and load of model utilities. * save and load fake features * rename patterns.pkl to patterns.json * update poetry.lock * ruff formatting * move skops import * add comments * clean up save_features and load_features * WIP: update model data saving and loading * add tests for save and load features * update tests for test_tracker_featurizer * update tests for test_tracker_featurizer * WIP: serialization of feature arrays. * update serialization and deserialization for feature array * remove not needed tests/utils/tensorflow/test_model_data_storage.py * start writing tests for feature array * update feature array tests * update tests * fix linting * add changelog * add new dependencies to .github/dependabot.yml * fix some tests * fix loading and saving of unexpected intent ted policy * fix linting issue * fix converting of features in cvf and lsf * fix lint issues * convert vocab in cvf * fix linting * update crf entity extractor * fix to_dict of crf_token * addressed type issues * ruff formatting * fix typing and lint issues * remove cloudpickle dependency * update logistic_regression_classifier and remove joblib as dependency * update formatting of pyproject.toml * next try: update formatting of pyproject.toml * update logging * update poetry.lock * refactor loading of lexical_syntactic_featurizer * rename FeatureMetadata.type -> FeatureMetadata.data_type * clean up tests test_features.py and test_crf_entity_extractor.py * update test_feature_array.py * check for type when loading tracker featurizer. * update changelog * fix line too long * move import of skops * Prepared release of version 3.10.9.dev1 (#1496) * prepared release of version 3.10.9.dev1 * update minimum model version * Check for 'step_id' and 'active_flow' keys in the metadata when adding 'ActionExecuted' event to flows paths stack. * fix parsing of commands * improve logging * formatting * add changelog * fix parse commands for multi step * [ATO-2985] - Windows model loading test (#1537) * Add test for model loading on windows * Improve the error message logged when handling the user message * Add a changelog * Fix Code Quality - line too long * Rasa-sdk-update (#1546) * all rasa-sdk micro updates * update poetry lock * update rasa-sdk in lock file * Remove trailing white sapce * Prepared release of version 3.10.11 (#1570) * prepared release of version 3.10.11 * add comments again in pyproject.toml * update poetry.lock * revert changes in github workflows * undo changes in pyproject.toml * update changelog * revert changes in github workflows * update poetry.lock * update poetry.lock * update pyproject.toml * update poetry.lock * update setuptools = '>=65.5.1,<75.6.0' * update setuptools = '~75.3.0' * reformat code * undo deleting of ping_slack_about_package_release.sh * fix formatting and type issues * downgrade setuptools to 70.3.0 * fixing logging issues (?) --------- Co-authored-by: sancharigr <[email protected]>
1 parent 66296b2 commit 2bb1d77

27 files changed

+1942
-616
lines changed

.github/workflows/continous-integration.yml

-1
Original file line numberDiff line numberDiff line change
@@ -1307,7 +1307,6 @@ jobs:
13071307
with:
13081308
args: "💥 New *Rasa Open Source * version `${{ github.ref_name }}` has been released!"
13091309

1310-
13111310
send_slack_notification_for_release_on_failure:
13121311
name: Notify Slack & Publish Release Notes
13131312
runs-on: ubuntu-24.04

changelog/1424.bugfix.md

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
Replace `pickle` and `joblib` with safer alternatives, e.g. `json`, `safetensors`, and `skops`, for
2+
serializing components.
3+
4+
**Note**: This is a model breaking change. Please retrain your model.
5+
6+
If you have a custom component that inherits from one of the components listed below and modified the `persist` or
7+
`load` method, make sure to update your code. Please contact us in case you encounter any problems.
8+
9+
Affected components:
10+
11+
- `CountVectorFeaturizer`
12+
- `LexicalSyntacticFeaturizer`
13+
- `LogisticRegressionClassifier`
14+
- `SklearnIntentClassifier`
15+
- `DIETClassifier`
16+
- `CRFEntityExtractor`
17+
- `TrackerFeaturizer`
18+
- `TEDPolicy`
19+
- `UnexpectedIntentTEDPolicy`

poetry.lock

+342-112
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

+5-4
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,6 @@ sanic-cors = "~2.0.0"
120120
sanic-jwt = "^1.6.0"
121121
sanic-routing = "^0.7.2"
122122
websockets = ">=10.0,<11.0"
123-
cloudpickle = ">=1.2,<2.3"
124123
aiohttp = ">=3.9.0,<3.10"
125124
questionary = ">=1.5.1,<1.11.0"
126125
prompt-toolkit = "^3.0,<3.0.29"
@@ -133,10 +132,9 @@ psycopg2-binary = ">=2.8.2,<2.10.0"
133132
python-dateutil = "~2.8"
134133
protobuf = ">=4.23.3,< 4.23.4"
135134
tensorflow_hub = "^0.13.0"
136-
setuptools = ">=65.5.1"
135+
setuptools = "~70.3.0"
137136
ujson = ">=1.35,<6.0"
138137
regex = ">=2020.6,<2022.11"
139-
joblib = ">=0.15.1,<1.3.0"
140138
sentry-sdk = ">=0.17.0,<1.15.0"
141139
aio-pika = ">=6.7.1,<8.2.4"
142140
aiogram = "<2.26"
@@ -156,6 +154,9 @@ dnspython = "2.3.0"
156154
wheel = ">=0.38.1"
157155
certifi = ">=2023.7.22"
158156
cryptography = ">=41.0.7"
157+
skops = "0.9.0"
158+
safetensors = "~0.4.5"
159+
159160
[[tool.poetry.dependencies.tensorflow-io-gcs-filesystem]]
160161
version = "==0.31"
161162
markers = "sys_platform == 'win32'"
@@ -285,7 +286,7 @@ version = "~3.2.0"
285286
optional = true
286287

287288
[tool.poetry.dependencies.transformers]
288-
version = ">=4.13.0, <=4.26.0"
289+
version = "~4.36.2"
289290
optional = true
290291

291292
[tool.poetry.dependencies.sentencepiece]

rasa/core/featurizers/single_state_featurizer.py

+22-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
import logging
2+
from typing import List, Optional, Dict, Text, Set, Any
3+
24
import numpy as np
35
import scipy.sparse
4-
from typing import List, Optional, Dict, Text, Set, Any
56

67
from rasa.core.featurizers.precomputation import MessageContainerForCoreFeaturization
78
from rasa.nlu.extractors.extractor import EntityTagSpec
@@ -362,6 +363,26 @@ def encode_all_labels(
362363
for action in domain.action_names_or_texts
363364
]
364365

366+
def to_dict(self) -> Dict[str, Any]:
367+
return {
368+
"action_texts": self.action_texts,
369+
"entity_tag_specs": self.entity_tag_specs,
370+
"feature_states": self._default_feature_states,
371+
}
372+
373+
@classmethod
374+
def create_from_dict(
375+
cls, data: Dict[str, Any]
376+
) -> Optional["SingleStateFeaturizer"]:
377+
if not data:
378+
return None
379+
380+
featurizer = SingleStateFeaturizer()
381+
featurizer.action_texts = data["action_texts"]
382+
featurizer._default_feature_states = data["feature_states"]
383+
featurizer.entity_tag_specs = data["entity_tag_specs"]
384+
return featurizer
385+
365386

366387
class IntentTokenizerSingleStateFeaturizer(SingleStateFeaturizer):
367388
"""A SingleStateFeaturizer for use with policies that predict intent labels."""

rasa/core/featurizers/tracker_featurizers.py

+115-18
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,9 @@
11
from __future__ import annotations
2-
from pathlib import Path
3-
from collections import defaultdict
4-
from abc import abstractmethod
5-
import jsonpickle
6-
import logging
72

8-
from tqdm import tqdm
3+
import logging
4+
from abc import abstractmethod
5+
from collections import defaultdict
6+
from pathlib import Path
97
from typing import (
108
Tuple,
119
List,
@@ -18,25 +16,30 @@
1816
Set,
1917
DefaultDict,
2018
cast,
19+
Type,
20+
Callable,
21+
ClassVar,
2122
)
23+
2224
import numpy as np
25+
from tqdm import tqdm
2326

24-
from rasa.core.featurizers.single_state_featurizer import SingleStateFeaturizer
25-
from rasa.core.featurizers.precomputation import MessageContainerForCoreFeaturization
26-
from rasa.core.exceptions import InvalidTrackerFeaturizerUsageError
2727
import rasa.shared.core.trackers
2828
import rasa.shared.utils.io
29-
from rasa.shared.nlu.constants import TEXT, INTENT, ENTITIES, ACTION_NAME
30-
from rasa.shared.nlu.training_data.features import Features
31-
from rasa.shared.core.trackers import DialogueStateTracker
32-
from rasa.shared.core.domain import State, Domain
33-
from rasa.shared.core.events import Event, ActionExecuted, UserUttered
29+
from rasa.core.exceptions import InvalidTrackerFeaturizerUsageError
30+
from rasa.core.featurizers.precomputation import MessageContainerForCoreFeaturization
31+
from rasa.core.featurizers.single_state_featurizer import SingleStateFeaturizer
3432
from rasa.shared.core.constants import (
3533
USER,
3634
ACTION_UNLIKELY_INTENT_NAME,
3735
PREVIOUS_ACTION,
3836
)
37+
from rasa.shared.core.domain import State, Domain
38+
from rasa.shared.core.events import Event, ActionExecuted, UserUttered
39+
from rasa.shared.core.trackers import DialogueStateTracker
3940
from rasa.shared.exceptions import RasaException
41+
from rasa.shared.nlu.constants import TEXT, INTENT, ENTITIES, ACTION_NAME
42+
from rasa.shared.nlu.training_data.features import Features
4043
from rasa.utils.tensorflow.constants import LABEL_PAD_ID
4144
from rasa.utils.tensorflow.model_data import ragged_array_to_ndarray
4245

@@ -64,6 +67,10 @@ def __str__(self) -> Text:
6467
class TrackerFeaturizer:
6568
"""Base class for actual tracker featurizers."""
6669

70+
# Class registry to store all subclasses
71+
_registry: ClassVar[Dict[str, Type["TrackerFeaturizer"]]] = {}
72+
_featurizer_type: str = "TrackerFeaturizer"
73+
6774
def __init__(
6875
self, state_featurizer: Optional[SingleStateFeaturizer] = None
6976
) -> None:
@@ -74,6 +81,36 @@ def __init__(
7481
"""
7582
self.state_featurizer = state_featurizer
7683

84+
@classmethod
85+
def register(cls, featurizer_type: str) -> Callable:
86+
"""Decorator to register featurizer subclasses."""
87+
88+
def wrapper(subclass: Type["TrackerFeaturizer"]) -> Type["TrackerFeaturizer"]:
89+
cls._registry[featurizer_type] = subclass
90+
# Store the type identifier in the class for serialization
91+
subclass._featurizer_type = featurizer_type
92+
return subclass
93+
94+
return wrapper
95+
96+
@classmethod
97+
def from_dict(cls, data: Dict[str, Any]) -> "TrackerFeaturizer":
98+
"""Create featurizer instance from dictionary."""
99+
featurizer_type = data.pop("type")
100+
101+
if featurizer_type not in cls._registry:
102+
raise ValueError(f"Unknown featurizer type: {featurizer_type}")
103+
104+
# Get the correct subclass and instantiate it
105+
subclass = cls._registry[featurizer_type]
106+
return subclass.create_from_dict(data)
107+
108+
@classmethod
109+
@abstractmethod
110+
def create_from_dict(cls, data: Dict[str, Any]) -> "TrackerFeaturizer":
111+
"""Each subclass must implement its own creation from dict method."""
112+
pass
113+
77114
@staticmethod
78115
def _create_states(
79116
tracker: DialogueStateTracker,
@@ -465,9 +502,7 @@ def persist(self, path: Union[Text, Path]) -> None:
465502
self.state_featurizer.entity_tag_specs = []
466503

467504
# noinspection PyTypeChecker
468-
rasa.shared.utils.io.write_text_file(
469-
str(jsonpickle.encode(self)), featurizer_file
470-
)
505+
rasa.shared.utils.io.dump_obj_as_json_to_file(featurizer_file, self.to_dict())
471506

472507
@staticmethod
473508
def load(path: Union[Text, Path]) -> Optional[TrackerFeaturizer]:
@@ -481,7 +516,17 @@ def load(path: Union[Text, Path]) -> Optional[TrackerFeaturizer]:
481516
"""
482517
featurizer_file = Path(path) / FEATURIZER_FILE
483518
if featurizer_file.is_file():
484-
return jsonpickle.decode(rasa.shared.utils.io.read_file(featurizer_file))
519+
data = rasa.shared.utils.io.read_json_file(featurizer_file)
520+
521+
if "type" not in data:
522+
logger.error(
523+
f"Couldn't load featurizer for policy. "
524+
f"File '{featurizer_file}' does not contain all "
525+
f"necessary information. 'type' is missing."
526+
)
527+
return None
528+
529+
return TrackerFeaturizer.from_dict(data)
485530

486531
logger.error(
487532
f"Couldn't load featurizer for policy. "
@@ -508,7 +553,16 @@ def _remove_action_unlikely_intent_from_events(events: List[Event]) -> List[Even
508553
)
509554
]
510555

556+
def to_dict(self) -> Dict[str, Any]:
557+
return {
558+
"type": self.__class__._featurizer_type,
559+
"state_featurizer": (
560+
self.state_featurizer.to_dict() if self.state_featurizer else None
561+
),
562+
}
563+
511564

565+
@TrackerFeaturizer.register("FullDialogueTrackerFeaturizer")
512566
class FullDialogueTrackerFeaturizer(TrackerFeaturizer):
513567
"""Creates full dialogue training data for time distributed architectures.
514568
@@ -646,7 +700,20 @@ def prediction_states(
646700

647701
return trackers_as_states
648702

703+
def to_dict(self) -> Dict[str, Any]:
704+
return super().to_dict()
649705

706+
@classmethod
707+
def create_from_dict(cls, data: Dict[str, Any]) -> "FullDialogueTrackerFeaturizer":
708+
state_featurizer = SingleStateFeaturizer.create_from_dict(
709+
data["state_featurizer"]
710+
)
711+
return cls(
712+
state_featurizer,
713+
)
714+
715+
716+
@TrackerFeaturizer.register("MaxHistoryTrackerFeaturizer")
650717
class MaxHistoryTrackerFeaturizer(TrackerFeaturizer):
651718
"""Truncates the tracker history into `max_history` long sequences.
652719
@@ -887,7 +954,25 @@ def prediction_states(
887954

888955
return trackers_as_states
889956

957+
def to_dict(self) -> Dict[str, Any]:
958+
data = super().to_dict()
959+
data.update(
960+
{
961+
"remove_duplicates": self.remove_duplicates,
962+
"max_history": self.max_history,
963+
}
964+
)
965+
return data
966+
967+
@classmethod
968+
def create_from_dict(cls, data: Dict[str, Any]) -> "MaxHistoryTrackerFeaturizer":
969+
state_featurizer = SingleStateFeaturizer.create_from_dict(
970+
data["state_featurizer"]
971+
)
972+
return cls(state_featurizer, data["max_history"], data["remove_duplicates"])
890973

974+
975+
@TrackerFeaturizer.register("IntentMaxHistoryTrackerFeaturizer")
891976
class IntentMaxHistoryTrackerFeaturizer(MaxHistoryTrackerFeaturizer):
892977
"""Truncates the tracker history into `max_history` long sequences.
893978
@@ -1166,6 +1251,18 @@ def prediction_states(
11661251

11671252
return trackers_as_states
11681253

1254+
def to_dict(self) -> Dict[str, Any]:
1255+
return super().to_dict()
1256+
1257+
@classmethod
1258+
def create_from_dict(
1259+
cls, data: Dict[str, Any]
1260+
) -> "IntentMaxHistoryTrackerFeaturizer":
1261+
state_featurizer = SingleStateFeaturizer.create_from_dict(
1262+
data["state_featurizer"]
1263+
)
1264+
return cls(state_featurizer, data["max_history"], data["remove_duplicates"])
1265+
11691266

11701267
def _is_prev_action_unlikely_intent_in_state(state: State) -> bool:
11711268
prev_action_name = state.get(PREVIOUS_ACTION, {}).get(ACTION_NAME)

0 commit comments

Comments
 (0)