Add support for collecting DHCP metrics from Kea Control Agent #2937

jorund1 · 2024-07-01T11:00:04Z

Implements #2931

Uses python 3.9 typehints

                      HTTP                       IPC
KeaDhcpMetricSource <------> Kea Control Agent <=====> Kea DHCP4 server / Kea DHCP6 server

Defines the KeaDhcpMetricSource class and its superclass DhcpMetricSource, having methods fetch_metrics and fetch_metrics_to_graphite that can be used to fetch metrics from a Kea DHCP server controlled by a Kea Control Agent and send these metrics to a graphite server. Example usage:

from nav.dhcp.kea_metrics import KeaDhcpMetricSource
import time

KEA_CTRL_AGENT_ADDR = "2001:db8::1"
KEA_CTRL_AGENT_PORT = 443

GRAPHITE_ADDR = "2001:db8::2"
GRAPHITE_PORT = 2003

# Collects metrics from the Kea DHCP4 server that the Kea Control
# Agent is configured to control
source = KeaDhcpMetricSource(
    address=KEA_CTRL_AGENT_ADDR,
    port=KEA_CTRL_AGENT_PORT,
    dhcp_version=4
)

while True:
    source.fetch_metrics_to_graphite(
        address=GRAPHITE_ADDR,
        port=GRAPHITE_PORT
    )
    time.sleep(600)

The other defined classes and functions are helpers and is not really meant to be used by other parts of NAV.

Todo:

Fix all TODO code comments.
Add support in KeaDhcpConfig.from_json for storing the configuration hash supplied in the json obtained from config-get queries.
Full test coverage.
- Test handling of config-hash-get queries in KeaDhcpMetricSource.fetch_and_set_dhcp_config and KeaDhcpMetricSource.fetch_dhcp_config_hash
- Test handling of statistic-get queries in KeaDhcpMetricSource.fetch_metrics
- Add more Kea DHCP server configuration example strings to test against, including DHCP6 configuration example strings.

CLAassistant · 2024-07-01T11:00:10Z

All committers have signed the CLA.

codecov · 2024-07-01T11:10:01Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.65%. Comparing base (8d039e0) to head (d3faa81).
Report is 16 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2937      +/-   ##
==========================================
+ Coverage   56.58%   56.65%   +0.07%     
==========================================
  Files         602      604       +2     
  Lines       43729    43890     +161     
  Branches       48       48              
==========================================
+ Hits        24744    24868     +124     
- Misses      18973    19010      +37     
  Partials       12       12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lunkwill42 · 2024-07-02T11:16:55Z

The tests are failing on Python 3.7. I just merged #2901 to get rid of Python 3.7 from the default test matrix, since NAV 5.10 is out and 5.11 will drop support for Python 3.7.

You may have to rebase this on the latest master branch commit to get the tests working on Github, @jorund1

lunkwill42

Thank you, @jorund1 !

I have not read and understood the code in full detail, but I have some generic feedback.

Since this is your first contribution, I like to mention that I am a proponent of the "step-down rule". Full quote:

We want the code to read like a top-down narrative. We want every function to be followed by those at the next level of abstraction so that we can read the program, descending one level of abstraction at a time as we read down the list of functions. I call this The Stepdown Rule.

To say this differently, we want to be able to read the program as though it were a set of TO paragraphs, each of which is describing the current level of abstraction and referencing subsequent TO paragraphs at the next level down.

To include the setups and teardowns, we include setups, then we include the test page content, and then we include the teardowns.

To include the setups, we include the suite setup if this is a suite, then we include the regular setup.

To include the suite setup, we search the parent hierarchy for the "SuiteSetUp" page and add an include statement with the path of that page.

To search the parent...

It turns out to be very difficult for programmers to learn to follow this rule and write functions that stay at a single level of abstraction. But learning this trick is also very important. It is the key to keeping functions short and making sure they do "one thing." Making the code read like a top-down set of TO paragraphs is an effective technique for keeping the abstraction level consistent.

In general, in kea_metrics_test.py, I would like to see the actual tests first in the code file, fixtures and helpers further towards the bottom. The only exception is when the programming language makes this ordering impossible, which can happend in Python, since it's not a pre-compiled language.

Other than that, I see you have written some code in order to validate JSON config data from Kea. At this point, we might consider pulling in Pydantic into NAV also, I think a lot of this code would disappear with it. We use Pydantic > 2 for validation of JSON data in several of the other projects our team maintains. It's not core to what you're really working on, but I at least urge you to have look at the library :)

lunkwill42 · 2024-07-02T11:44:36Z

python/nav/dhcp/generic_metrics.py

+        fmt = str.maketrans({".": "_", "/": "_"})  # 192.0.2.0/24 --> 192_0_0_0_24
+        graphite_metrics = []
+        for metric in self.fetch_metrics():
+            graphite_path = f"{self.graphite_prefix}.{str(metric.subnet_prefix).translate(fmt)}.{metric.key}"


Since this is already being based on the NAV codebase, you may be better off just using nav.metrics.names.escape_metric_name() rather than building your own translation table here.

lunkwill42 · 2024-07-02T12:49:31Z

tests/unittests/dhcp/kea_metrics_test.py

+################################################################################
+
+
+def test_success_responses_does_succeed(success_response, enqueue_post_response):


Suggested change

def test_success_responses_does_succeed(success_response, enqueue_post_response):

def test_success_responses_should_succeed(success_response, enqueue_post_response):

lunkwill42 · 2024-07-02T12:49:43Z

tests/unittests/dhcp/kea_metrics_test.py

+        assert isinstance(response.service, str)
+
+
+def test_error_responses_does_not_succeed(error_response, enqueue_post_response):


Suggested change

def test_error_responses_does_not_succeed(error_response, enqueue_post_response):

def test_error_responses_should_not_succeed(error_response, enqueue_post_response):

lunkwill42 · 2024-07-02T12:50:00Z

tests/unittests/dhcp/kea_metrics_test.py

+        assert isinstance(response.service, str)
+
+
+def test_invalid_json_responses_raises_jsonerror(


Suggested change

def test_invalid_json_responses_raises_jsonerror(

def test_invalid_json_responses_should_raise_jsonerror(

lunkwill42 · 2024-07-02T12:53:35Z

tests/unittests/dhcp/kea_metrics_test.py

+################################################################################
+# Testing KeaDhcpSubnet and KeaDhcpConfig instantiation from json              #
+################################################################################


You should potentially consider grouping tests for individual classes by using classes, e.g.

class KeaDhcpSubnetTest: def test_dhcp4_subnet4_config_should_be_parsed_correctly(self, dhcp4_config): ...

lunkwill42 · 2024-07-02T13:00:53Z

tests/unittests/dhcp/kea_metrics_test.py

+def test_correct_subnet_from_dhcp4_config_json(dhcp4_config):
+    j = json.loads(dhcp4_config)
+    subnet = KeaDhcpSubnet.from_json(j["Dhcp4"]["subnet4"][0])
+    assert subnet.id == 1
+    assert subnet.prefix == IP("192.0.0.0/8")
+    assert len(subnet.pools) == 2
+    assert subnet.pools[0] == (IP("192.1.0.1"), IP("192.1.0.200"))
+    assert subnet.pools[1] == (IP("192.3.0.1"), IP("192.3.0.200"))
+
+
+def test_correct_config_from_dhcp4_config_json(dhcp4_config):
+    j = json.loads(dhcp4_config)
+    config = KeaDhcpConfig.from_json(j)
+    assert len(config.subnets) == 1
+    subnet = config.subnets[0]
+    assert subnet.id == 1
+    assert subnet.prefix == IP("192.0.0.0/8")
+    assert len(subnet.pools) == 2
+    assert subnet.pools[0] == (IP("192.1.0.1"), IP("192.1.0.200"))
+    assert subnet.pools[1] == (IP("192.3.0.1"), IP("192.3.0.200"))
+    assert config.dhcp_version == 4
+    assert config.config_hash is None


Not entirely sure what these tests are really testing for, so I'm having a hard time suggesting alternate names here:

I generally prefer that tests are named very explicitly so that when I'm looking at a test report and see a failing test, I won't have to dig deeply into the test code itself to figure out what requirement it was that actually failed.

Names along the lines of "when x happens y should do z" (or, if grouping tests in classes to test individual functions or classes in multiple ways, "when x happens it should do z")

Thank you for helpful input. Make sure to continue pointing out any conventions, guidelines or modules I've missed out on!

Regarding naming of test functions. I agree that the function names are unclear. I've not followed any convention, but for those places where I'm testing a single function, the test function's name usually has been some combination of fail|correct|invalid , function name, and function input. I'll begin to name my test functions in such a way that the name becomes a summary of the test's intention.

Regarding Pydantic. There's definetively a use-case for Pydantic here. Validating takes up big portions of the code and is not fun to read.. Some manual processing must alas be done so long as we use config-get instead of the more well tailored but not universally available subnet4-list command of the API to fetch subnet information.

stveit · 2024-07-24T10:45:53Z

python/nav/dhcp/generic_metrics.py

+    def fetch_metrics_to_graphite(self, host, port):
+        graphite_metrics = []
+        for metric in self.fetch_metrics():
+            graphite_path = f"{self.graphite_prefix}.{escape_metric_name(metric.subnet_prefix.strNormal())}.{metric.key}"


if the frontend is supposed to use this path to generate a graph then I imagine the path format should be defined somewhere else, probably in python/nav/metrics/templates like #2405 does

stveit · 2024-07-24T10:51:27Z

python/nav/dhcp/kea_metrics.py

+                            netprefix,
+                            kea_name,
+                        )
+                    for val, t in timeseries:


if t stands for time then might as well call it timeor timestampor something more explicit, and just call val value to be even more explicit

stveit · 2024-07-24T10:56:48Z

python/nav/dhcp/kea_metrics.py

+
+    def _send_query(self, session: requests.Session, command: str, **kwargs) -> dict:
+        """
+        Send `command` to the Kea Control Agent. An exception is raised iff


Suggested change

Send `command` to the Kea Control Agent. An exception is raised iff

Send `command` to the Kea Control Agent. An exception is raised if

stveit · 2024-07-24T10:57:52Z

python/nav/dhcp/kea_metrics.py

+                f"a query (responded with: {responses!r})"
+            )
+        if not (len(responses) == 1 and "result" in responses[0]):
+            # "We've only sent the command to *one* service. Thus responses should contain *one* response."


stveit · 2024-07-24T11:01:44Z

tests/unittests/dhcp/kea_metrics_test.py

+def test_all_responses_is_empty_but_valid_should_yield_no_metrics(
+    valid_dhcp4, responsequeue
+):
+    """
+    If the Kea DHCP server we query does not have any subnets configured, the
+    correct thing to do is to return an empty iterable, (as opposed to failing).
+
+    Likewise, if it returns no statistics for its configured subnets, the
+    correct thing to do is to return an empty iterable.
+    """


Can this be split into 2 tests? One for "no subnets configured" and one "no statistics for configured subnets"

Why: The main metrics one wishes to obtain from a dhcp server of a particular type (Kea DHCP, ISC DHCP, udhcpd, etc.) are the same accross the board, and thus the methods that process these metrics (e.g. sending them to a graphite server, creating a canonical graphite path for a specific type of metric, etc) are better off being defined once in a superclass.

…Source clearer

KeaDhcpMetricSource is an implementation of DhcpMetricSource which collects the four metrics defined in the DhcpMetricKeys enum (number of total, used, free and touched addresses) for each vlan of a Kea DHCP server.

Why: I'll need to check more carefully how to obtain the amount of free addresses in a subnet, it proboably must be calculated since Kea doesn't seem to supply it.

Need to see how to deal with shared networks that might also be defined as well. Should be exactly the same as for subnets, since a shared network really is just a uniquely named list of subnets.

why: This is build-up for an up-coming commit that implements caching of self.kea_dhcp_config in KeaDhcpMetricSource; everytime we fetch a new kea_dhcp_config with fetch_dhcp_config(), we would like to store it as well - hence the name change of the function. The up-coming commit will then make use of Kea Control Agent's `config-hash-get` command (included in Kea versions >= 2.4.0) to check if we need to update the cached config or not whenever set_and_fetch_dhcp_config() is called.

…new config

What: In addition to updating function names etc., I've also updated the mocking of the requests.post requests in the test script so that we can give different respsonses based on the Kea Control Agent command the kea_dhcp_data script sends to the Kea Control Agent server with its request.post calls

For now, I'll just call this new function for unwrap(), since all of the uses of send_query() expects a list of one response and the first thing that is done is always to unwrap the singleton response list. unwrap() could be made to have generic typing, but for simplicity it uses KeaResponse instead of a generic TypeVar('T') for now.

Also fixes a typo on line 342 where an extra comma had sneaked itself into the code

What: Before, we only included subnets defined in the subnet[4,6] section of the Kea DHCP config obtained through the "config-get" query in the KeaDhcpConfig.subnets list. Now we also include the subnets defined in the shared-networks section of the Kea DHCP config, and thus we include all subnets than could possibly be configured for a Kea DHCP server, which means that we can now fetch metrics from all defined subnets.

…urce

Why: Datetime instances are the most precise, and can easily be converted to unix timestamps is need be. Datetime instances also makes it easy to work with timezone differences, which we sadly seem to have to care about since the Kea Control Agent doesn't provide timezone data along with its timestamps.

Why: If we don't obtain the config, we do not have enough information to start fetching subnet metrics; in this case there is no way to move forward and there's no hope of obtaining some useful data by continuing the call.

what: prior to this commit, the key for a specific metric (i.e. the name of that metric used by NAV) had the same naming convention as dhcpd-pools(1), e.g. "cur" was the name used for the "amount of addresses currently assigned to dhcp clients on this subnet" and "max" was the name used for the "total amount of addresses controlled by this subnet". DhcpMetricKey.CUR and DhcpMetricKey.MAX, however, is not very descriptive, so I changed the key names to be DhcpMetricKey.TOTAL and DhcpMetricKey.ASSIGNED. DhcpMetricKey.TOUCH was removed all together because it seems to me like this is not a common metric to be reported by dhcp servers (dhcpd-pools(1) uses "touch" to mean the number of assigned addresses that has timed out but that is not yet marked as re-assignable by the dhcp-server).

what: Modify the docstrings so that they all follow the same pattern; i.e. they all begin by describing what they return.

why: In the future, one might want to include sensitive information, such as passwords or tokens in requests, and a response from Kea might contain secrets, especially with regards to "config-get" responses, where a config might contain passwords.

what: before this change, our mocking functionality only allowed for setting the string of the response object that is returned by requests.post() and requests.Session().post(), by using responsequeue.add("<command-name>", "<returned-string>"). responsequeue.autofill("dhcp<4 or 6>", config_to_return, statistics_to_return) This commit modifies adds an extra parameter to both of these functions, the `attrs` parameter: responsequeue.add("<command-name>", "<returned-string>", attrs={}). responsequeue.autofill("dhcp<4 or 6>", config_to_return, statistics_to_return, attrs={}) if `attrs` = {"myattr": "myval"}, then the requests.Response() object returned by any call to requests.post() or requests.Session().post() will have the attribute "myattr": attrs = {"myattr", "myval"} responsequeue.add("<command-name>", "<returned-string>", attrs) response = requests.post(...) assert response.myattr == "myval"

(this is what the graphite exporting function expects)

lunkwill42 assigned jorund1 Jul 2, 2024

jorund1 force-pushed the kea-ctrl-agent-metrics branch from d157f0b to 0640c0c Compare July 2, 2024 11:34

lunkwill42 requested changes Jul 2, 2024

View reviewed changes

lunkwill42 requested a review from stveit July 2, 2024 13:15

stveit reviewed Jul 24, 2024

View reviewed changes

jorund1 force-pushed the kea-ctrl-agent-metrics branch 2 times, most recently from 4c8f756 to 78607bf Compare July 30, 2024 10:20

jorund1 added 20 commits September 20, 2024 10:33

Add dataclasses for queries to and responses from the Kea REST api

301006f

Add dataclasses for Kea subnets and Kea DHCP config files

85823db

Rename kea_stats.py to dhcp/kea_dhcp_data.py

e6ef7ad

Sort import statements

a0edbc0

Add comments and more precise typing to make intentions of DhcpMetric…

4f47a30

…Source clearer

Implement KeaDhcpMetricSource as subclass of DhcpMetricSource

879b6ba

KeaDhcpMetricSource is an implementation of DhcpMetricSource which collects the four metrics defined in the DhcpMetricKeys enum (number of total, used, free and touched addresses) for each vlan of a Kea DHCP server.

Remove "free addresses" as a Dhcp metric key temporarily

48ebfcf

Why: I'll need to check more carefully how to obtain the amount of free addresses in a subnet, it proboably must be calculated since Kea doesn't seem to supply it.

Finish metric extraction from configured subnets

1f65f3f

Need to see how to deal with shared networks that might also be defined as well. Should be exactly the same as for subnets, since a shared network really is just a uniquely named list of subnets.

Collect metrics per subnet instead of per vlan in DhcpMetricSource

fdc497d

Check if config hash on local and on server is equal before fetching …

29b82e8

…new config

Warn if config changed on server during fetching of metrics

03f2b8c

Rewrite tests to work with newest code

4ce605f

Update variable names

c3e34ea

Update comments

542038e

Also fixes a typo on line 342 where an extra comma had sneaked itself into the code

Add exception handling in the fetch_metrics method of KeaDhcpMetricSo…

3a462ac

…urce

jorund1 added 25 commits September 20, 2024 10:33

Rewrite tests

8dacb95

Fix constant modified-config-during-metric-fetch error logging

65a4941

Fail if status of a config's response is not SUCCESS

4f7a32c

Why: If we don't obtain the config, we do not have enough information to start fetching subnet metrics; in this case there is no way to move forward and there's no hope of obtaining some useful data by continuing the call.

Rename variables and tidy up code

cac787d

Adjust documentation

3eafeb6

Fix linting errors

92f54e1

Fix imports

6a4b5eb

Return List type, not Iterator in kea_metrics.fetch_metrics

3582bd3

Enhance docstrings

5826f22

Alter exception handling

30297e0

Use NAVs configured carbon host and port by default when sending metrics

17b5a81

Update docstrings

1b0d670

what: Modify the docstrings so that they all follow the same pattern; i.e. they all begin by describing what they return.

Actually raise a KeaException on HTTP response error statuses

bea8d67

Fix wrong usage of str.join() function

963e326

Add test that assures that KeaException bubbles up on any HTTP error

3dd8472

Fix linting errors

8d5281b

Declare and use dhcp metric path in nav.metrics.templates

df18080

Change dhcp subnet metric path to include IP address and port

7322f4f

Set http requests' content-type to JSON

3fd40e0

Use numeric timestamp instead of datetime instances for dhcp metrics

3fb60a4

(this is what the graphite exporting function expects)

Make linter happy

76da903

jorund1 force-pushed the kea-ctrl-agent-metrics branch from 78607bf to 76da903 Compare September 20, 2024 08:46

jorund1 added 4 commits September 20, 2024 10:47

Update canonical dhcp metric path template's docstring

8a3030e

Enhance docstrings, using suggestions from PEP 257

9b7dc94

Disinclude ip-address and tcp-port from graphite dhcp metric path

000f4df

Extract a fetch_subnet_metrics() help-func out from fetch_metrics()

d3faa81

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for collecting DHCP metrics from Kea Control Agent #2937

Add support for collecting DHCP metrics from Kea Control Agent #2937

jorund1 commented Jul 1, 2024 •

edited

Loading

CLAassistant commented Jul 1, 2024 •

edited

Loading

codecov bot commented Jul 1, 2024 •

edited

Loading

lunkwill42 commented Jul 2, 2024

lunkwill42 left a comment

lunkwill42 Jul 2, 2024

lunkwill42 Jul 2, 2024

lunkwill42 Jul 2, 2024

lunkwill42 Jul 2, 2024

lunkwill42 Jul 2, 2024

lunkwill42 Jul 2, 2024

jorund1 Jul 2, 2024 •

edited

Loading

stveit Jul 24, 2024

stveit Jul 24, 2024

stveit Jul 24, 2024

stveit Jul 24, 2024

stveit Jul 24, 2024

		################################################################################


		def test_success_responses_does_succeed(success_response, enqueue_post_response):

	def test_success_responses_does_succeed(success_response, enqueue_post_response):
	def test_success_responses_should_succeed(success_response, enqueue_post_response):

		assert isinstance(response.service, str)


		def test_error_responses_does_not_succeed(error_response, enqueue_post_response):

	def test_error_responses_does_not_succeed(error_response, enqueue_post_response):
	def test_error_responses_should_not_succeed(error_response, enqueue_post_response):

		assert isinstance(response.service, str)


		def test_invalid_json_responses_raises_jsonerror(

	def test_invalid_json_responses_raises_jsonerror(
	def test_invalid_json_responses_should_raise_jsonerror(

	Send `command` to the Kea Control Agent. An exception is raised iff
	Send `command` to the Kea Control Agent. An exception is raised if

Add support for collecting DHCP metrics from Kea Control Agent #2937

Are you sure you want to change the base?

Add support for collecting DHCP metrics from Kea Control Agent #2937

Conversation

jorund1 commented Jul 1, 2024 • edited Loading

CLAassistant commented Jul 1, 2024 • edited Loading

codecov bot commented Jul 1, 2024 • edited Loading

Codecov Report

lunkwill42 commented Jul 2, 2024

lunkwill42 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorund1 Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorund1 commented Jul 1, 2024 •

edited

Loading

CLAassistant commented Jul 1, 2024 •

edited

Loading

codecov bot commented Jul 1, 2024 •

edited

Loading

jorund1 Jul 2, 2024 •

edited

Loading