Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove n^2 loops #94

Closed
bdemchak opened this issue Aug 3, 2022 · 14 comments
Closed

Remove n^2 loops #94

bdemchak opened this issue Aug 3, 2022 · 14 comments

Comments

@bdemchak
Copy link
Collaborator

bdemchak commented Aug 3, 2022

There are a number of functions that contain checks for a node or edge list being contained within a network. For very large networks and very large lists, this becomes an n^2 operation. I recently verified that for a million node network, the check in load_table_data() took longer than 2 days (before I gave up).

The general pattern is to create a comprehension and then test whether True or False is in the comprehension.

For example:

    test_present = [x in all_names   for x in node_suids]
    if not False in test_present:
        return node_suids

These can be found in:

    py4cytoscape_utils:node_suid_to_node_name()
    py4cytoscape_utils:edge_suid_to_edge_name()
    py4cytoscape_utils:_item_to_suid()
@GeneCodeSavvy
Copy link
Collaborator

I was considering an approach where we first convert the data into a dictionary for SUID to name mappings, and store all names in a set. The dictionary would provide O(1) average-time lookups for SUIDs, and the set would also offer O(1) average-time lookups for names. After setting up these structures, we could process the input list by checking whether each item is a SUID or a name, thereby minimizing the need for repeated linear searches. This approach would allow us to process the input list in O(M) time, where M is the number of SUIDs. Should I create a PR to the branch 1.10?

Example: node_suid_to_node_name()
Screenshot 2024-09-07 at 13 00 23

Reference:
https://wiki.python.org/moin/TimeComplexity

@bdemchak
Copy link
Collaborator Author

@GeneCodeSavvy

Hi, Harsh ...

This looks plausible ... good work.

To test this, you'd need to do three things:

  1. Create a very large test case that makes the original code look slow, and then show that your new code has substantial improvement.

  2. Execute the test_py4cytoscape_utils:test_node_suid_to_node_name() unit test (or all of the test_py4cytoscape_utils tests) to verify sanity and reasonable edge cases. To execute tests, start by looking over deep testing.

The quick and easy test would be ./runalltests.bat test_py4cytoscape_utils.py. Remember to have started Cytoscape before running the test.

  1. Execute the entire test suite (./runalltests.bat). This will take a LONG time, and a few errors might pop out, especially around the apps tests. To get a feel for "normal" errors, you might run the suite before making your changes, and then after.

Feel free to ask questions as they arise. You're on a good track.

@GeneCodeSavvy
Copy link
Collaborator

@bdemchak I'll get started on it right away.

@GeneCodeSavvy
Copy link
Collaborator

@bdemchak Hi Barry, I was having some minor issues with test cases. I have figured out the problem. Sorry for not updating you about this sooner. I will finish up and ensure everything works as expected, one last time. Thank you for your patience!

@bdemchak
Copy link
Collaborator Author

@GeneCodeSavvy Thanks, Harsh ... I'm the only one that ever runs those tests, so it's easy to believe that there would be issues when run by someone else. Just curious ... what did you run into? Maybe we can smooth some rough spots over?

@GeneCodeSavvy
Copy link
Collaborator

@bdemchak Apologies for the confusion. The issue wasn't with the tests themselves. The optimized py4cytoscape_utils:node_suid_to_node_name() was failing the following test:

suids_with_name = suids.copy()
suids_with_name.append('YBR043C')
self.assertRaises(CyError, node_name_to_node_suid, suids_with_name)

In hindsight, this was straightforward, but it took me longer than expected to resolve.

@GeneCodeSavvy
Copy link
Collaborator

GeneCodeSavvy commented Sep 12, 2024

image

I created a very large DataFrame with 1 million nodes, each having a unique SUID and randomly generated name.
I also generated test lists of varying sizes (1,000, 10,000, 100,000) by sampling random SUIDs from the DataFrame.
The optimized version completed in under 0.24 seconds for the largest test size, while the original version took over 1,145 seconds (nearly 20 minutes) for the same size.

Screenshot 2024-09-13 at 05 08 47

Screenshot 2024-09-13 at 05 02 29

Screenshot 2024-09-13 at 05 03 47

As for the issue, I was encountering - I add a two checks are_all_suids = all(isinstance(item, int) for item in node_suids) and are_all_names = all(isinstance(item, str) for item in node_suids). Got this idea from the normalize_list() function.

Screenshot 2024-09-13 at 05 04 36

@bdemchak
Copy link
Collaborator Author

@GeneCodeSavvy

Thanks, Harsh ... this is promising ... I like the performance and you have good ideas.

Can you package this as a pull request into the 1.9.0 branch (not master)?

It should have two parts:

  • the py4cytoscape_utils changes
  • the test_py4cytoscape_utils changes

You'll want to insert your large dataframe tests into the applicable unit tests (e.g., test_node_suid_to_node_name(), test_node_name_to_node_suid(), test_edge_suid_to_edge_name(), test_edge_name_to_edge_suid())

The test changes aren't so interesting and don't contribute to the actual py4cytoscape library ... but they serve two critical functions. First, to keep us honest that we've actually tested all of the cases. Second, to give us confidence that future library modifications don't break existing py4cytoscape functionality. You've already seen the benefit of this when you discovered that your original code was missing a check for a mixed list of SUIDs and names.

Since this is your first time through this process, I'd expect that we'll likely iterate on the pull request before it's suitable for integration.

If we can get to that point, we'll include it in the 1.10.0 release, which can happen right after the pull request is in good shape.

OK?

(Well done!)

@GeneCodeSavvy
Copy link
Collaborator

@bmdemchak
Thanks, Barry! I’m glad to hear that the performance improvements look promising. I’ll start preparing the pull request into the 1.9.0 branch.

I now fully understand the critical role that the test changes play. I appreciate the guidance and am ready to iterate on this pull request to ensure it's in great shape for integration.

@GeneCodeSavvy
Copy link
Collaborator

GeneCodeSavvy commented Sep 17, 2024

@bdemchak

Hi Bary,
I added a PR #139 with only the py4cytoscape_utils changes, as couldn't figure out after a way to create a large network with a million nodes efficiently. I experimented with batch processing, also replacing for loops with vector operations.
As shown earlier, I worked on the optimization with the dataframes directly (node table dataframe), not a network. Therefore, for now it seems - only working with an already existing large network is feasible.

Modified function I came up with before giving up and submitting the PR. It is generally twice as fast compared to the current version, but not enough.

def create_network_from_data_frames(nodes=None, edges=None, title='From dataframe',
                                    collection='My Dataframe Network Collection', base_url=DEFAULT_BASE_URL, *,
                                    node_id_list='id', source_id_list='source', target_id_list='target',
                                    interaction_type_list='interaction', batch_size=10000):

    # Validate input
    if nodes is None and edges is None:
        raise CyError('Must provide either nodes or edges')

    # Create nodes from edges if not provided
    if nodes is None:
        unique_nodes = pd.concat([edges[source_id_list], edges[target_id_list]]).unique()
        nodes = pd.DataFrame({node_id_list: unique_nodes})

    # Prepare nodes JSON using vectorized operations
    nodes = nodes.drop(['SUID'], axis=1, errors='ignore')
    json_nodes = [{'data': {node_id_list: node}} for node in nodes[node_id_list].values]

    # Prepare edges JSON using vectorized operations
    json_edges = []
    if edges is not None:
        edges = edges.drop(['SUID'], axis=1, errors='ignore')
        if interaction_type_list not in edges.columns:
            edges[interaction_type_list] = 'interacts with'
        
        edges['name'] = edges.apply(lambda row: f"{row[source_id_list]} ({row[interaction_type_list]}) {row[target_id_list]}", axis=1)
        
        edges_sub = edges[[source_id_list, target_id_list, interaction_type_list, 'name']].values
        json_edges = [{'data': {'name': edge[3], 'source': edge[0], 'target': edge[1], 'interaction': edge[2]}} for edge in edges_sub]

    # Batch processing
    network_suid = None
    for i in range(0, len(json_nodes), batch_size):
        batch_nodes = json_nodes[i:i+batch_size]
        batch_edges = [edge for edge in json_edges if edge['data']['source'] in {node['data'][node_id_list] for node in batch_nodes} or edge['data']['target'] in {node['data'][node_id_list] for node in batch_nodes}]
        
        json_network = {
            'data': [{'name': title}],
            'elements': {
                'nodes': batch_nodes,
                'edges': batch_edges
            }
        }

        if network_suid is None:
            # Create the network for the first batch
            network_suid = commands.cyrest_post('networks', parameters={'title': title, 'collection': collection},
                                                body=json_network, base_url=base_url)['networkSUID']
        else:
            # Add to the existing network for subsequent batches
            commands.cyrest_post(f'networks/{network_suid}/nodes', body=batch_nodes, base_url=base_url)
            commands.cyrest_post(f'networks/{network_suid}/edges', body=batch_edges, base_url=base_url)

    # Load node attributes
    node_attrs = set(nodes.columns) - {node_id_list}
    if node_attrs:
        attr_data = nodes.dropna(subset=node_attrs)
        tables.load_table_data(attr_data, data_key_column=node_id_list, table_key_column='id', 
                               network=network_suid, base_url=base_url)

    # Load edge attributes
    if edges is not None:
        edge_attrs = set(edges.columns) - {source_id_list, target_id_list, interaction_type_list, 'name'}
        if edge_attrs:
            attr_data = edges[['name'] + list(edge_attrs)].dropna()
            tables.load_table_data(attr_data, data_key_column='name', table='edge', table_key_column='name', 
                                   network=network_suid, base_url=base_url)

    # Apply default style and layout
    commands.commands_post('vizmap apply styles="default"', base_url=base_url)
    layouts.layout_network(network=network_suid, base_url=base_url)

    return network_suid

For some reason, I get the following error when I try to create a network with more than 10 thousand nodes - with the current function as well as the optimized version

In cyrest_post(): Could not parse the given network JSON: target node is not a member of this network
{
	"name": "CyError",
	"message": "In cyrest_post(): Could not parse the given network JSON: target node is not a member of this network",
	"stack": "---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
File ~/Desktop/100xdevs/py4cytoscape/py4cytoscape/commands.py:191, in cyrest_post(operation, parameters, body, base_url, require_json)
    190 r = _do_request('POST', url, params=parameters, json=body, headers = {'Content-Type': 'application/json'}, base_url=base_url)
--> 191 r.raise_for_status()
    192 try:

File ~/Desktop/100xdevs/py4cytoscape/venv/lib/python3.12/site-packages/requests/models.py:1024, in Response.raise_for_status(self)
   1023 if http_error_msg:
-> 1024     raise HTTPError(http_error_msg, response=self)

HTTPError: 412 Client Error: Precondition Failed for url: http://127.0.0.1:1234/v1/networks?title=Large+Network1&collection=My+Dataframe+Network+Collection

During handling of the above exception, another exception occurred:

CyError                                   Traceback (most recent call last)
Cell In[102], line 1
----> 1 network_suid = p4c.create_network_from_data_frames(
      2     nodes, 
      3     edges, 
      4     title='Large Network1',
      5     node_id_list='id',  # Specify which column contains the node IDs
      6     source_id_list='source',
      7     target_id_list='target'
      8 )

File ~/Desktop/100xdevs/py4cytoscape/py4cytoscape/py4cytoscape_logger.py:133, in cy_log.<locals>.wrapper_log(*args, **kwargs)
    131     return log_return(func, value)
    132 except Exception as e:
--> 133     log_exception(func, e)
    134 finally:
    135     log_finally()

File ~/Desktop/100xdevs/py4cytoscape/py4cytoscape/py4cytoscape_logger.py:130, in cy_log.<locals>.wrapper_log(*args, **kwargs)
    128 log_incoming(func, *args, **kwargs)
    129 try:
--> 130     value = func(*args, **kwargs) # Call function being logged
    131     return log_return(func, value)
    132 except Exception as e:

File ~/Desktop/100xdevs/py4cytoscape/py4cytoscape/networks.py:1086, in create_network_from_data_frames(nodes, edges, title, collection, base_url, node_id_list, source_id_list, target_id_list, interaction_type_list, batch_size)
   1038 @cy_log
   1039 def create_network_from_data_frames(nodes=None, edges=None, title='From dataframe',
   1040                                     collection='My Dataframe Network Collection', base_url=DEFAULT_BASE_URL, *,
   1041                                     node_id_list='id', source_id_list='source', target_id_list='target',
   1042                                     interaction_type_list='interaction'):
   1043     \"\"\"Create a network from data frames.
   1044 
   1045     Takes data frames for nodes and edges, as well as naming parameters to generate the JSON data format required by
   1046     the \"networks\" POST operation via CyREST. Returns the network.suid and applies the preferred layout set in
   1047     Cytoscape preferences.
   1048 
   1049     Notes:
   1050         ``nodes`` should contain a column of character strings named: id. This name can be overridden by the arg:
   1051         ``node_id_list``. Additional columns are loaded as node attributes. ``edges`` should contain columns of
   1052         character strings named: source, target and interaction. These names can be overridden by args:
   1053         source_id_list, target_id_list, interaction_type_list. Additional columns are loaded as edge attributes.
   1054         The ``interaction`` list can contain a single value to apply to all rows; and if excluded altogether, the
   1055         interaction type will be set to \"interacts with\". NOTE: attribute values of types (num) will be imported
   1056         as (Double); (int) as (Integer); (chr) as (String); and (logical) as (Boolean). (Lists) will be imported as
   1057         (Lists) in CyREST v3.9+.
   1058 
   1059         Note that the extra ``id`` column is created in the node table because the ``id`` column is mandatory in the
   1060         cytoscape.js format, which is what is sent to Cytoscape.
   1061 
   1062     Args:
   1063         nodes (DataFrame): see details and examples below; default NULL to derive nodes from edge sources and targets
   1064         edges (DataFrame): see details and examples below; default NULL for disconnected set of nodes
   1065         title (str): network name
   1066         collection (str): network collection name
   1067         base_url (str): Ignore unless you need to specify a custom domain,
   1068             port or version to connect to the CyREST API. Default is http://127.0.0.1:1234
   1069             and the latest version of the CyREST API supported by this version of py4cytoscape.
   1070         * :
   1071         node_id_list (str): Name of column in ``nodes`` containing node id
   1072         source_id_list (str): Name of column in ``edges`` containing source node name
   1073         target_id_list (str): Name of column in ``edges``  containing target node name
   1074         interaction_type_list (str): Name of column in ``edges``  containing interaction name
   1075 
   1076     Returns:
   1077         int: The ``SUID`` of the new network
   1078 
   1079     Raises:
   1080         ValueError: if server response has no JSON
   1081         CyError: if network name or SUID doesn't exist
   1082         requests.exceptions.RequestException: if can't connect to Cytoscape or Cytoscape returns an error
   1083 
   1084     Examples:
   1085         >>> node_data = {'id':[\"node 0\",\"node 1\",\"node 2\",\"node 3\"],
-> 1086         >>>              'group':[\"A\",\"A\",\"B\",\"B\"],
   1087         >>>              'score':[20,10,15,5]}
   1088         >>> nodes = df.DataFrame(data=node_data, columns=['id', 'group', 'score'])
   1089         >>> edge_data = {'source':[\"node 0\",\"node 0\",\"node 0\",\"node 2\"],
   1090         >>>              'target':[\"node 1\",\"node 2\",\"node 3\",\"node 3\"],
   1091         >>>              'interaction':[\"inhibits\",\"interacts\",\"activates\",\"interacts\"],
   1092         >>>              'weight':[5.1,3.0,5.2,9.9]}
   1093         >>> edges = df.DataFrame(data=edge_data, columns=['source', 'target', 'interaction', 'weight'])
   1094         >>>
   1095         >>> create_network_from_data_frames(nodes, edges, title='From node & edge dataframe')
   1096         1477
   1097     \"\"\"
   1099     def compute_edge_name(source, target, interaction):
   1100         return source + ' (' + interaction + ') ' + target

File ~/Desktop/100xdevs/py4cytoscape/py4cytoscape/py4cytoscape_logger.py:133, in cy_log.<locals>.wrapper_log(*args, **kwargs)
    131     return log_return(func, value)
    132 except Exception as e:
--> 133     log_exception(func, e)
    134 finally:
    135     log_finally()

File ~/Desktop/100xdevs/py4cytoscape/py4cytoscape/py4cytoscape_logger.py:130, in cy_log.<locals>.wrapper_log(*args, **kwargs)
    128 log_incoming(func, *args, **kwargs)
    129 try:
--> 130     value = func(*args, **kwargs) # Call function being logged
    131     return log_return(func, value)
    132 except Exception as e:

File ~/Desktop/100xdevs/py4cytoscape/py4cytoscape/commands.py:200, in cyrest_post(operation, parameters, body, base_url, require_json)
    198             return r.text
    199 except requests.exceptions.RequestException as e:
--> 200     _handle_error(e)

File ~/Desktop/100xdevs/py4cytoscape/py4cytoscape/commands.py:683, in _handle_error(e, force_cy_error)
    681     else:
    682         show_error(f'In {caller}: {e}\
{content}')
--> 683 raise e

CyError: In cyrest_post(): Could not parse the given network JSON: target node is not a member of this network"
}

Function that can be used to create dummy dataframes for node tables and edge tables

import pandas as pd
import numpy as np
from random import choices
from string import ascii_uppercase, digits

def generate_large_network_data(num_nodes=100000, num_edges=100000):
    """
    Generate data for a large network with specified number of nodes and edges.
    
    Args:
        num_nodes (int): Number of nodes to generate. Default is 1,000,000.
        num_edges (int): Number of edges to generate. Default is 1,000,000.
    
    Returns:
        tuple: A tuple containing two pandas DataFrames:
            - nodes_df: DataFrame with node data (id, name, attribute)
            - edges_df: DataFrame with edge data (source, target, interaction, weight)
    """
    
    # Generate node data
    node_ids = [f"node {i}" for i in np.arange(1, num_nodes + 1).tolist()]
    node_names = [''.join(choices(ascii_uppercase + digits, k=8)) for _ in range(num_nodes)]
    
    nodes_data = pd.DataFrame({
        'id': node_ids,
        'name': node_names,
        # 'attribute': node_attributes
    })
    
    source_nodes = np.random.choice(node_ids, num_edges).tolist()
    target_nodes = np.random.choice(node_ids, num_edges).tolist()
    interactions = np.random.choice(['activates', 'inhibits', 'binds'], num_edges).tolist()
    
    edges_data = pd.DataFrame({
        'source': source_nodes,
        'target': target_nodes,
        'interaction': interactions,
        # 'weight': weights
    })
    
    return nodes_data, edges_data

# Example usage:
nodes, edges = generate_large_network_data()

@bdemchak
Copy link
Collaborator Author

@GeneCodeSavvy Hi, Harsh ... good news and bad news.

Good news: 3.10.0 was sent to PyPI today ... thanks for your help!

Bad news: I missed a supplementary test that showed an issue with the new code. Would you mind looking here?

@GeneCodeSavvy
Copy link
Collaborator

GeneCodeSavvy commented Sep 27, 2024

@bdemchak Hi Barry,

Thanks for the update! I’ve looked into the issue and made a small fix. I’ve explained the root cause in issue #142 and have already submitted a new PR (#143) to resolve it.

Since version 3.10.0 has already been published to PyPI, I’m curious to know what the next steps would be in a situation like this. As a contributor, it would be a great learning experience for me to understand how to handle such scenarios moving forward.

@bdemchak
Copy link
Collaborator Author

@GeneCodeSavvy We could retract the release, but we'd still have to replace it with 1.11.0 ... there's no slip-replacement of 1.10.0. Not a problem, though, as long as we don't take a long time doing it.

I did notice your replacement PR ... the new branch is 1.11.0 because 1.10.0 was sealed when the release was made. When I get a chance, I'll try to change your PR so that it applies to 1.11.0 instead of 1.10.0. If I fail, I'll ask you to re-issue it ... but please let me see if I can do it from here. (This happens often enough that it would be good to handle it gracefully on my end.)

@GeneCodeSavvy
Copy link
Collaborator

@bdemchak Okay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants