Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_parquet slow for kg with 200k nodes #202

Open
davidshumway opened this issue Oct 24, 2021 · 0 comments
Open

load_parquet slow for kg with 200k nodes #202

davidshumway opened this issue Oct 24, 2021 · 0 comments

Comments

@davidshumway
Copy link

davidshumway commented Oct 24, 2021

Very nice library!

Just exploring a little and noticed that load_parquet seems to be hanging when loading from a saved parquet file. At least, it's taking a lot longer to read the kg from file than it did to create the original kg. While it takes 2 minutes to generate the kg from a csv (kg.add(...)), it's taking over 15 minutes to load the file and appears to be hanging? Any ideas?

The parquet file is ~9MB, and the kg has 200k nodes and 4 Literal relations per node.

The code to load the file is:

kg2 = kglab.KnowledgeGraph(
  name = "...",
  base_uri = "/ex/",
  namespaces = {
    'sosa': 'http://www.w3.org/ns/sosa/'
  },
)
import time
t0 = time.time()
kg2.load_parquet('kg.parquet')
print('Read time: {}s'.format(round((time.time() - t0), 2)))
measure = kglab.Measure()
measure.measure_graph(kg)
print("edges", measure.get_edge_count())
print("nodes", measure.get_node_count())
# edges 1018040
# nodes 203609
@davidshumway davidshumway changed the title load_parquet hangs for kg with 200k nodes load_parquet slow for kg with 200k nodes Oct 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant