-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pagination / offset of results for the SPARQL endpoint is faulty #150
Comments
@danizen any ideas here or examples of successfully using the API for multi-page results? |
When we were in the prototype phase, I argued that SPARQL is fine as a query language, but the whole deal with federated SPARQL and paging is flawed. Virtuoso also may truncate results. I went to bat at work for another team having direct access to the Virtuoso SQL interface (over JDBC) because of this. So, I have devoted less time to validating this than I should, and will look into it. I have long wanted to automatically add a count of results, but I don't think I have time to do that soon. If I can recreate your issue, I will do it as an automated test and keep the fix DRY. |
I suspect that there is some string processing that would add the LIMIT and OFFSET to the query via string processing rather than paging from a temporary graph or model. If so, this could be hard to fix. To fix the results you see would be a pretty fundamental change - we would need a query cache, with a query ID based on the sessionID and the query text. A similar cache to the one I propose is described in RFC 1813 and applies to NFS. Virtuoso may also support this itself, and I would have to involve @simonjupp from https://github.com/EBISPOT/lodestar as to a proper solution. As a shorter term alternative, you can simply drop the LIMIT 5 clause from your query and it should work. Also, by doing a COUNT() aggregation query before your query, you may be able to discover if Virtuoso has truncated the query. |
Ah that makes sense, the limit and offset parameters get appended to the query, such that my earlier example probably became: LIMIT 5
LIMIT 10
OFFSET 4 It looks like limit is set to 1000 when no limit is specified or a limit > 1000 is specified. The easy solution would be to allow for higher limits, perhaps limiting queries by some other mechanism like execution time if required. But since the limit is capped at 1000, I tried two methods of determining when pagination has exhausted all results:
|
Let me think about this a bit. |
I see no problems in the code or differences with the EBISPOT upstream, so I will try to duplicate your results. One thing this won't change is whether we return more than 1000 results to a single http request/response. The implementation of limit and offset do not use string concatenation, but the parsed Apache Jena query. |
After a review of the code, an attempt to reproduce, and a re-read of the SPARQL spec., I think this is working as designed. The LIMIT is from the OFFSET, and therefore designed to allow you to page through a large result set: https://www.w3.org/TR/rdf-sparql-query/#modOffset There may however be some Virtuoso setting limiting the the actual result set, and I will check on this tomorrow. |
Thanks @danizen for looking into this and narrowing down where things are going wrong. Here's a versioned link for |
Thanks for making the MeSH RDF SPARQL API. It's been convenient for quick access to MeSH.
I'd like to do a query that returns over 1000 results, and therefore need to figure out how to use pagination with the SPARQL API at
https://id.nlm.nih.gov/mesh/sparql
. Here's my query to return a table of descriptors:But I'm having trouble incrementing
limit
andoffset
to retrieve all results.In search of a more reproducible example, I've simplified it to the this API call, generated by this python code:
The expected result is to receive a single record (the 5th record), because the query should return 5 records, and the offset is 4. Instead, 5 records are returned. The returned records under results.bindings start with:
So it looks like offset was respected, but something about the SPARQL
LIMIT 5
or API parameterlimit=10
does not work.The text was updated successfully, but these errors were encountered: