-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove n^2 loops #94
Comments
I was considering an approach where we first convert the data into a Example: node_suid_to_node_name() Reference: |
Hi, Harsh ... This looks plausible ... good work. To test this, you'd need to do three things:
The quick and easy test would be
Feel free to ask questions as they arise. You're on a good track. |
@bdemchak I'll get started on it right away. |
@bdemchak Hi Barry, I was having some minor issues with test cases. I have figured out the problem. Sorry for not updating you about this sooner. I will finish up and ensure everything works as expected, one last time. Thank you for your patience! |
@GeneCodeSavvy Thanks, Harsh ... I'm the only one that ever runs those tests, so it's easy to believe that there would be issues when run by someone else. Just curious ... what did you run into? Maybe we can smooth some rough spots over? |
@bdemchak Apologies for the confusion. The issue wasn't with the tests themselves. The optimized
In hindsight, this was straightforward, but it took me longer than expected to resolve. |
I created a very large DataFrame with 1 million nodes, each having a unique SUID and randomly generated name. As for the issue, I was encountering - I add a two checks |
Thanks, Harsh ... this is promising ... I like the performance and you have good ideas. Can you package this as a pull request into the 1.9.0 branch (not master)? It should have two parts:
You'll want to insert your large dataframe tests into the applicable unit tests (e.g., The test changes aren't so interesting and don't contribute to the actual py4cytoscape library ... but they serve two critical functions. First, to keep us honest that we've actually tested all of the cases. Second, to give us confidence that future library modifications don't break existing py4cytoscape functionality. You've already seen the benefit of this when you discovered that your original code was missing a check for a mixed list of SUIDs and names. Since this is your first time through this process, I'd expect that we'll likely iterate on the pull request before it's suitable for integration. If we can get to that point, we'll include it in the 1.10.0 release, which can happen right after the pull request is in good shape. OK? (Well done!) |
@bmdemchak I now fully understand the critical role that the test changes play. I appreciate the guidance and am ready to iterate on this pull request to ensure it's in great shape for integration. |
Hi Bary, Modified function I came up with before giving up and submitting the PR. It is generally twice as fast compared to the current version, but not enough. def create_network_from_data_frames(nodes=None, edges=None, title='From dataframe',
collection='My Dataframe Network Collection', base_url=DEFAULT_BASE_URL, *,
node_id_list='id', source_id_list='source', target_id_list='target',
interaction_type_list='interaction', batch_size=10000):
# Validate input
if nodes is None and edges is None:
raise CyError('Must provide either nodes or edges')
# Create nodes from edges if not provided
if nodes is None:
unique_nodes = pd.concat([edges[source_id_list], edges[target_id_list]]).unique()
nodes = pd.DataFrame({node_id_list: unique_nodes})
# Prepare nodes JSON using vectorized operations
nodes = nodes.drop(['SUID'], axis=1, errors='ignore')
json_nodes = [{'data': {node_id_list: node}} for node in nodes[node_id_list].values]
# Prepare edges JSON using vectorized operations
json_edges = []
if edges is not None:
edges = edges.drop(['SUID'], axis=1, errors='ignore')
if interaction_type_list not in edges.columns:
edges[interaction_type_list] = 'interacts with'
edges['name'] = edges.apply(lambda row: f"{row[source_id_list]} ({row[interaction_type_list]}) {row[target_id_list]}", axis=1)
edges_sub = edges[[source_id_list, target_id_list, interaction_type_list, 'name']].values
json_edges = [{'data': {'name': edge[3], 'source': edge[0], 'target': edge[1], 'interaction': edge[2]}} for edge in edges_sub]
# Batch processing
network_suid = None
for i in range(0, len(json_nodes), batch_size):
batch_nodes = json_nodes[i:i+batch_size]
batch_edges = [edge for edge in json_edges if edge['data']['source'] in {node['data'][node_id_list] for node in batch_nodes} or edge['data']['target'] in {node['data'][node_id_list] for node in batch_nodes}]
json_network = {
'data': [{'name': title}],
'elements': {
'nodes': batch_nodes,
'edges': batch_edges
}
}
if network_suid is None:
# Create the network for the first batch
network_suid = commands.cyrest_post('networks', parameters={'title': title, 'collection': collection},
body=json_network, base_url=base_url)['networkSUID']
else:
# Add to the existing network for subsequent batches
commands.cyrest_post(f'networks/{network_suid}/nodes', body=batch_nodes, base_url=base_url)
commands.cyrest_post(f'networks/{network_suid}/edges', body=batch_edges, base_url=base_url)
# Load node attributes
node_attrs = set(nodes.columns) - {node_id_list}
if node_attrs:
attr_data = nodes.dropna(subset=node_attrs)
tables.load_table_data(attr_data, data_key_column=node_id_list, table_key_column='id',
network=network_suid, base_url=base_url)
# Load edge attributes
if edges is not None:
edge_attrs = set(edges.columns) - {source_id_list, target_id_list, interaction_type_list, 'name'}
if edge_attrs:
attr_data = edges[['name'] + list(edge_attrs)].dropna()
tables.load_table_data(attr_data, data_key_column='name', table='edge', table_key_column='name',
network=network_suid, base_url=base_url)
# Apply default style and layout
commands.commands_post('vizmap apply styles="default"', base_url=base_url)
layouts.layout_network(network=network_suid, base_url=base_url)
return network_suid For some reason, I get the following error when I try to create a network with more than 10 thousand nodes - with the current function as well as the optimized version
Function that can be used to create dummy dataframes for node tables and edge tables import pandas as pd
import numpy as np
from random import choices
from string import ascii_uppercase, digits
def generate_large_network_data(num_nodes=100000, num_edges=100000):
"""
Generate data for a large network with specified number of nodes and edges.
Args:
num_nodes (int): Number of nodes to generate. Default is 1,000,000.
num_edges (int): Number of edges to generate. Default is 1,000,000.
Returns:
tuple: A tuple containing two pandas DataFrames:
- nodes_df: DataFrame with node data (id, name, attribute)
- edges_df: DataFrame with edge data (source, target, interaction, weight)
"""
# Generate node data
node_ids = [f"node {i}" for i in np.arange(1, num_nodes + 1).tolist()]
node_names = [''.join(choices(ascii_uppercase + digits, k=8)) for _ in range(num_nodes)]
nodes_data = pd.DataFrame({
'id': node_ids,
'name': node_names,
# 'attribute': node_attributes
})
source_nodes = np.random.choice(node_ids, num_edges).tolist()
target_nodes = np.random.choice(node_ids, num_edges).tolist()
interactions = np.random.choice(['activates', 'inhibits', 'binds'], num_edges).tolist()
edges_data = pd.DataFrame({
'source': source_nodes,
'target': target_nodes,
'interaction': interactions,
# 'weight': weights
})
return nodes_data, edges_data
# Example usage:
nodes, edges = generate_large_network_data() |
@GeneCodeSavvy Hi, Harsh ... good news and bad news. Good news: 3.10.0 was sent to PyPI today ... thanks for your help! Bad news: I missed a supplementary test that showed an issue with the new code. Would you mind looking here? |
@bdemchak Hi Barry, Thanks for the update! I’ve looked into the issue and made a small fix. I’ve explained the root cause in issue #142 and have already submitted a new PR (#143) to resolve it. Since version 3.10.0 has already been published to PyPI, I’m curious to know what the next steps would be in a situation like this. As a contributor, it would be a great learning experience for me to understand how to handle such scenarios moving forward. |
@GeneCodeSavvy We could retract the release, but we'd still have to replace it with 1.11.0 ... there's no slip-replacement of 1.10.0. Not a problem, though, as long as we don't take a long time doing it. I did notice your replacement PR ... the new branch is 1.11.0 because 1.10.0 was sealed when the release was made. When I get a chance, I'll try to change your PR so that it applies to 1.11.0 instead of 1.10.0. If I fail, I'll ask you to re-issue it ... but please let me see if I can do it from here. (This happens often enough that it would be good to handle it gracefully on my end.) |
@bdemchak Okay |
There are a number of functions that contain checks for a node or edge list being contained within a network. For very large networks and very large lists, this becomes an n^2 operation. I recently verified that for a million node network, the check in load_table_data() took longer than 2 days (before I gave up).
The general pattern is to create a comprehension and then test whether True or False is in the comprehension.
For example:
These can be found in:
The text was updated successfully, but these errors were encountered: