feat(data-manager): Added a data manager class #81

talolard · 2020-04-03T12:10:16Z

Per #80 made a DataManager class.
Probably easiest to get a sense of it through the examples, but highlight is that users don't need to do bookeeping because they can just do

        for ix in indices_to_label:
            original_ix = manager.get_original_index_from_unlabeled_index(ix)
            y = original_labels_train[original_ix]
            label = (ix, y)
            labels.append(label)
        manager.add_labels(labels)
        # clear_output()
    learner.teach(manager.labeled, manager.labels)

And the manager keeps everything in sync.

I bet the API can be improved, but it's a solid and convenient start. Looking forward to feedback

cosmic-cortex · 2020-05-09T12:15:54Z

@talolard Hi! I am really sorry, but I forgot to review the PR :( In the last month I was working crazy hard to launch a new startup and a new product, so I set everything aside. If it is still fine by you, I'll review it now and add it.

cosmic-cortex

I have highlighted some issues which should be changed. Otherwise, I don't have any suggestions to improve the API at the moment, it feels like intuitive for me.

cosmic-cortex · 2020-05-09T12:38:00Z

modAL/datamanager.py

+        """
+        This is where the magic happens.
+        We take self.unlabeled_mask.nonzero()[0] which gives us an array of the indices that appear in the unlabeled
+        data. So if the original label was at position 0 we look up the "real index" in the unlabeled_indices array to
+        get it's true index
+        :param labels:
+        :return:
+        """


Can you rewrite the docstrings in google format? (https://www.sphinx-doc.org/en/master/usage/extensions/example_google.html) This is what is used throughout modAL, and it would be best to keep them consistent.

cosmic-cortex · 2020-05-09T12:49:16Z

modAL/datamanager.py

+    def _update_masks(self, labels: Union[LabelList, Label]):
+        for label in labels:
+            self.labeled_mask[label[0]] = True


This won't work if labels is a single Label. I suggest to always require a list of labels.

@talolard Hi! I am really sorry, but I forgot to review the PR :( In the last month I was working crazy hard to launch a new startup and a new product, so I set everything aside. If it is still fine by you, I'll review it now and add it.

Hi, no worries.
I'm moving this into it's own package and will update docs here with an example using it. Hope the launch went well

Cool! Let me know, when the package is available!

cosmic-cortex · 2020-05-09T12:58:28Z

modAL/datamanager.py

+            self._labels[label[0]] = label[1]
+
+    @property
+    def unlabeld(self):


This is probably a typo, should be unlabeled.

cosmic-cortex · 2020-05-09T13:00:31Z

modAL/datamanager.py

+            raise Exception(
+                "Malformed input. Please add either a tuple (ix,label) or a list [(ix,label),..]"
+            )


This should be a TypeError instead of a generic exception, since it perfectly fits the scope of TypeError.

cosmic-cortex · 2020-05-09T13:03:57Z

modAL/datamanager.py

+        correctLabels: LabelList = []
+        unlabeled_indices = self.unlabeled_mask.nonzero()[0]
+
+        for label in labels:
+            newIndex = unlabeled_indices[label[0]]
+            newLabel: Label = (newIndex, label[1])
+            correctLabels.append(newLabel)


Can you rename the variables from camelCase to use lower_and_under conventions?

cosmic-cortex · 2020-05-09T13:05:27Z

modAL/datamanager.py

+        for label in labels:
+            self.labeled_mask[label[0]] = True
+
+    def _offset_new_labes(self, labels: LabelList):


This is just a typo, can you change the function name to _offset_new_labels?

cosmic-cortex · 2020-05-09T13:17:37Z

modAL/datamanager.py

+    @property
+    def labels(self):
+        '''
+
+        Returns the labels indexed with respect to LD
+
+        '''
+        return self._labels[self.labeled_mask]


This can change the data type of labels, if labels_dtype is not explicitly given upon initialization. How about adding a new attribute to the class where we store the label_dtype? This can be None by default and set during the add_labels method, so this method can return self._labels[self.labeled_mask].astype(self.label_dtype). What do you think?

cosmic-cortex · 2020-05-09T13:31:26Z

modAL/datamanager.py

+        elif isinstance(labels, list):
+            pass
+        else:
+            raise Exception(
+                "Malformed input. Please add either a tuple (ix,label) or a list [(ix,label),..]"
+            )


I would write the following here:

elif not isinstance(labels, Container): raise TypeError("Malformed input. Please add either a tuple (ix,label) or a list [(ix,label),..]")

where Collections is from collections.abc.

cosmic-cortex · 2020-05-09T13:33:42Z

modAL/datamanager.py

+
+        :param features: An array of the features that will be used for AL.
+        :param labels: Any prexesiting labels. Each label is a tuple(idx,label)
+        :param source: A list of the original data


I don't really understand what sources is used for, can you explain it in more detail?

In any case, it is not used by any of the methods (except remaining_sources), so it seems to me that it can be removed.

cosmic-cortex · 2020-05-09T13:38:46Z

examples/data_manager_and_text_classification.py

+        for index in range(1):
+            # query the learner as usual, in this case we are using a batch learning strategy
+            # so indices_to_label is an array
+            indices_to_label, query_instance = learner.query(manager.unlabeld)
+            labels = []  # Hold a list of the new labels
+            for ix in indices_to_label:
+                """
+                Here is the tricky part that the manager solves. The indicies are indexed with respect to unlabeled data
+                but we want to work with them with respect to the original data. The manager makes this almost transparent
+                """
+                # Map the index that is with respect to unlabeled data back to an index with respect to the whole dataset
+                original_ix = manager.get_original_index_from_unlabeled_index(ix)
+                # print(manager.sources[original_ix]) #Show the original data so we can decide what to label
+                # Now we can lookup the label in the original set of labels without any bookkeeping
+                y = original_labels_train[original_ix]
+                # We create a Label instance, a tuple of index and label
+                # The index should be with respect to the unlabeled data, the add_labels function will automatically
+                # calculate the offsets
+                label = (ix, y)
+                # append the labels to a list
+                labels.append(label)
+            # Insert them all at once.
+            manager.add_labels(labels)
+            # Note that if you need to add labels with indicies that repsect the original dataset you can do
+            # manager.add_labels(labels,offset_to_unlabeled=False)


Why do you put this into a for block? the index variable is not used and the iterator which you loop through has a single element.

codecov-io · 2021-02-08T00:09:08Z

Codecov Report

Merging #81 (3281990) into dev (1e23453) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##              dev      #81   +/-   ##
=======================================
  Coverage   97.19%   97.19%           
=======================================
  Files          31       31           
  Lines        1641     1641           
=======================================
  Hits         1595     1595           
  Misses         46       46

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e23453...3281990. Read the comment docs.

feat(data-manager): Added a data manager class

3281990

talolard closed this May 9, 2020

cosmic-cortex reopened this May 9, 2020

cosmic-cortex reviewed May 9, 2020

View reviewed changes

cosmic-cortex mentioned this pull request Aug 4, 2020

Why teach method fits the data? #95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data-manager): Added a data manager class #81

feat(data-manager): Added a data manager class #81

talolard commented Apr 3, 2020

cosmic-cortex commented May 9, 2020

cosmic-cortex left a comment

cosmic-cortex May 9, 2020

cosmic-cortex May 9, 2020

talolard May 10, 2020

cosmic-cortex May 14, 2020

cosmic-cortex May 9, 2020

cosmic-cortex May 9, 2020

cosmic-cortex May 9, 2020

cosmic-cortex May 9, 2020

cosmic-cortex May 9, 2020

cosmic-cortex May 9, 2020

cosmic-cortex May 9, 2020

cosmic-cortex May 9, 2020

codecov-io commented Feb 8, 2021

feat(data-manager): Added a data manager class #81

Are you sure you want to change the base?

feat(data-manager): Added a data manager class #81

Conversation

talolard commented Apr 3, 2020

cosmic-cortex commented May 9, 2020

cosmic-cortex left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Feb 8, 2021

Codecov Report