Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reindex subset of vertices #4726

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ntisseyre
Copy link
Contributor

@ntisseyre ntisseyre commented Nov 15, 2024

Summary

This PR introduces a significant optimization to the reindexing process in JanusGraph by allowing a subset of vertices to be reindexed instead of scanning the entire storage.
This enhancement provides substantial performance improvements, primarily when the specific subset of vertices for indexing is already known.

NOTE

This feature is currently supported only for CQL storage. Other storage backends still need to be implemented.

KeyColumnValueStore.java

KeyIterator getKeys(final List<StaticBuffer> keys, final SliceQuery query, final StoreTransaction txh) throws BackendException {
        throw new NotImplementedException();
    }

Motivation

Previously, reindexing required scanning all vertices in storage, which could be highly resource-intensive and time-consuming, particularly in large datasets.
This update enables users to focus on a targeted subset of vertices, reducing the time and computational load for reindexing. This is especially beneficial in environments where only specific vertices are relevant to a given index or data update.

Changes

  • Added the ability to specify a subset of vertices to include in the reindexing process.
  • Optimized the indexing engine to skip unnecessary vertices, focusing only on those specified in the subset.

API in JanusGraphManagement

  /**
     * Updates the provided index according to the given {@link SchemaAction} for
     * the given subset of vertices.
     *
     * @param index
     * @param updateAction
     * @param vertexOnly Set of vertexIds that only should be considered for index update
     * @return a future that completes when the index action is done
     */
ScanJobFuture updateIndex(Index index, SchemaAction updateAction, List<Object> vertexOnly);

Benefits

  • Improved Performance: By narrowing down the scope of vertices, the reindexing process is much faster and more efficient.
  • Resource Optimization: Reduces CPU and memory usage during reindexing by avoiding a full scan.
    Enhanced Flexibility: This feature allows users to update specific sections of the graph more easily without impacting the entire dataset.

Backward Compatibility

This feature is backward compatible and does not impact existing functionality. Users not specifying a subset will still experience the previous behavior of scanning the entire storage.

@porunov porunov added this to the 1.2.0 milestone Nov 21, 2024
Copy link
Member

@porunov porunov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ntisseyre !
Looks great! I have just two small comments.

Comment on lines +833 to +836
public static final ConfigOption<Integer> KEYS_SIZE = new ConfigOption<>(STORAGE_NS,"keys-size",
"The maximum amount of keys/partitions to retrieve from distributed storage system by JanusGraph in a single request.",
ConfigOption.Type.MASKABLE, 100);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this config was not added into the configuration reference and that's why CI is failing.
Could you please execute mvn --quiet clean install -DskipTests=true -pl janusgraph-doc -am and amend your commit? This will automatically re-generate configuration reference documentation.

import java.util.List;
import java.util.function.Function;

public class CQLSubsetIterator<TItem> implements Iterator<TItem> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nitpick)
Codacy suggests:

Generics names should be a one letter long and upper case.

I usually use all upper case, but not always one letter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants