Skip to content

Increasing search time when using page and limit #1309

@christianskou07

Description

@christianskou07

If one wants to fetch all attributes of a certain type, it seems the recommended approach is through the use of page and limit parameters, iterating through pages until the returned amount of attributes is less than limit.

As an example, I have more than 5 mio attributes of type md5 in my instance and want to fetch all of them.

From experiments it seems regardless of the value of limit, the search time increases, measured as such:

def fetch_attributes(self):
    response_count = 1
    l = 20000
    p = 1
    sum_attr = 0
    sum_time = 0
    while response_count > 0:
        t0 = time.time()
        attributes = self.client.search(controller="attributes", return_format="json", type_attribute=["md5"], page=p, limit=l)
        t1 = time.time()
        total = t1 - t0
        response_count = len(attributes["Attribute"])
        p += 1
        sum_attr += response_count
        sum_time += total
        print(f"fetched {len(attributes["Attribute"])} attributes in {total}, sum_attr = {sum_attr}, sum_time = {sum_time}")

Example output (see attachment for full output):

fetched 20000 attributes in 4.9379072189331055, sum_attr = 20000, sum_time = 4.9379072189331055
fetched 20000 attributes in 4.651666879653931, sum_attr = 40000, sum_time = 9.589574098587036
fetched 20000 attributes in 4.8340137004852295, sum_attr = 60000, sum_time = 14.423587799072266
fetched 20000 attributes in 3.9235310554504395, sum_attr = 80000, sum_time = 18.347118854522705
fetched 20000 attributes in 4.641859292984009, sum_attr = 100000, sum_time = 22.988978147506714
...
fetched 20000 attributes in 12.544357299804688, sum_attr = 1380000, sum_time = 558.8374326229095
fetched 20000 attributes in 11.658548831939697, sum_attr = 1400000, sum_time = 570.4959814548492
fetched 20000 attributes in 12.361718893051147, sum_attr = 1420000, sum_time = 582.8577003479004
fetched 20000 attributes in 13.921313285827637, sum_attr = 1440000, sum_time = 596.779013633728

Please do not pay attention to the actual values, but more the clear trend of queries taking longer and longer as pages are iterated.

Below I have included some of the configuration parameters I found relevant. From htop it does not seem like CPU or memory resources are exhausted, however, I am not an expert in interpreting it.

php memory_limit is set to 8192 MB.

/etc/my.cnf.d/server.cnf:

...
[mysqld]
datadir=/data/mysql-data
innodb_buffer_pool_size=4G
innodb_io_capacity=1000
innodb_log_file_size=600MB
innodb_read_io_threads=16
...

MISP version 2.4.198
PyMISP version 2.5.1
MariaDB version 11.4.3

Full print output:
output.txt

Please let me know if you need anymore information or if this issue belongs to the MISP project instead.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions