Skip to content

Conversation

@eric-eclecticiq
Copy link
Contributor

@eric-eclecticiq eric-eclecticiq commented Dec 11, 2025

Motivation

Getting objects from a collection with enterprise-attack data was very slow. It took more than 12s on my computer while there are only ~20k objects. Also, adding a limit=1 did not improve the performance.

Optimizations

Query to find last/first

Note: last is the default is no option is provided on the API.

Initially the query was:

SELECT ...
FROM
    opentaxii_stixobject
WHERE
    opentaxii_stixobject.collection_id = '9025dea8-7be9-4693-86d3-95d31b32bcbe'
    AND opentaxii_stixobject.pk IN (
        SELECT
            opentaxii_stixobject.pk
        FROM
            opentaxii_stixobject
            JOIN (
                SELECT
                    opentaxii_stixobject.id AS id,
                    max(opentaxii_stixobject.version) AS max_version
                FROM
                    opentaxii_stixobject
                WHERE
                    opentaxii_stixobject.collection_id = '9025dea8-7be9-4693-86d3-95d31b32bcbe'
                GROUP BY
                    opentaxii_stixobject.id
            ) AS anon_1 ON opentaxii_stixobject.id = anon_1.id
            AND opentaxii_stixobject.version = anon_1.max_version
    )
ORDER BY
    opentaxii_stixobject.date_added,
    opentaxii_stixobject.id;

with plan:

SEARCH opentaxii_stixobject USING INDEX ix_opentaxii_stixobject_collection_id (collection_id=?)
LIST SUBQUERY 2
CO-ROUTINE anon_1
SEARCH opentaxii_stixobject USING COVERING INDEX sqlite_autoindex_opentaxii_stixobject_2 (collection_id=?)
SCAN anon_1
SEARCH opentaxii_stixobject USING INDEX ix_opentaxii_stixobject_version (version=?)
CREATE BLOOM FILTER
USE TEMP B-TREE FOR ORDER BY

I updated the query to only look for the versions of the current row, not all:

SELECT ...
FROM
    opentaxii_stixobject
WHERE
    opentaxii_stixobject.collection_id = ?
    AND (
        SELECT
            max(oso_max.version) AS max_version
        FROM
            opentaxii_stixobject AS oso_max
        WHERE
            oso_max.collection_id = ?
            AND oso_max.id = opentaxii_stixobject.id
        GROUP BY
            oso_max.id
    ) = opentaxii_stixobject.version
ORDER BY
    opentaxii_stixobject.date_added,
    opentaxii_stixobject.id

with new plan:

SEARCH opentaxii_stixobject USING INDEX ix_opentaxii_stixobject_collection_id (collection_id=?)
CORRELATED SCALAR SUBQUERY 1
SEARCH oso USING COVERING INDEX sqlite_autoindex_opentaxii_stixobject_2 (collection_id=? AND id=?)
USE TEMP B-TREE FOR ORDER BY

The query is also much more simpler.

The index is not compatible with collection id filter

Change ix_opentaxii_stixobject_date_added_id to ix_opentaxii_stixobject_col_date_added_id by adding the collection_id as first parameter.

New plan:

SEARCH opentaxii_stixobject USING INDEX ix_opentaxii_stixobject_col_date_added_id (collection_id=?)
CORRELATED SCALAR SUBQUERY 1
SEARCH oso USING COVERING INDEX sqlite_autoindex_opentaxii_stixobject_2 (collection_id=? AND id=?)

Do not count more than necessary

The API puts in the response, a more boolean to inform if the pagination could return more results. Initially, it was counting all elements independently of the limit. Now, it is counting at most limit + 1 objects.

Results

  1. Fetch all objects: 6s
  2. Fetch 10 objects: 130ms

@eric-eclecticiq eric-eclecticiq requested a review from saaj December 15, 2025 09:20
Copy link
Contributor

@saaj saaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

),
)
version_filters.append(
max_versions_subq == taxii2models.STIXObject.version
Copy link
Contributor

@saaj saaj Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a JOIN to a correlated subquery. Some databases are good at rewriting such queries automatically, but I wouldn't bet on all the supported ones doing it well. And JOINs also easier to read (at SQL level). But not sure how good it lends itself to given version filtering code. So up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean JOIN. The problem I saw is that TAXII2 is allowing to return both the first and last elements at the same time or any specific version. But in the first case, that would add 2 disjoint joins that would return no result.

@eric-eclecticiq eric-eclecticiq merged commit 04c1e6c into master Dec 15, 2025
9 checks passed
@eric-eclecticiq eric-eclecticiq deleted the get-objects-performance-optimization branch December 15, 2025 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants