-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: support for range-based matching #16
Conversation
optional Term|Range[]|Range? predicate, | ||
optional Term|Range[]|Range? object, | ||
optional Term|Range[]|Range? graph | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When one would use Range at any other position in the quad than object
? I think it should only apply to Literals which only can appear on object
position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have never had to do it myself but I've seen prefix-based queries targeting subject
and graph
named nodes a few times.
Added for clarity: a prefix-based query is a special case of a range-based query where the range is defined as { gte: 'prefix', lte: 'prefix' + delimiter }
.
I would propose to use a little bit different approach. You can have a look at this branch, where I quickly hacked together some interfaces. It lacks some details but should cover the core concept: A
{
type: 'customWhitespaceBeginEnd',
test: term => term.value === term.value.trim()
}
|
@jacoscaz I very much like this idea! As I see it, this PR is a first step towards pushing filters (in this case range-based filters) into the Source itself. As such, I would propose to have a look at the SPARQL algebra, as that gives us a very good example of what kind of filters are possible, and how you can declaratively defined them. To make feature-detection of filter-based matching easier, I would propose to add an optional method to the interface Source {
matchExpression?(subject: Term, predicate: Term, object: Term, graph: Term, expression: Expression): Stream
} I'm wondering if now would be a good time to also start thinking about (not necessarily defining) how ordering could be requested as well? Because that may influence this PR as well. Perhaps we can also add a parameter for this to @bergos I would actually vote against such a |
@bergos thank you for your example. I now realize I had misunderstood what you have written on Gitter - apologies. Nonetheless, here's my take on what you suggest. Changing the On one side, the viability of this would depend on having a shared reference implementation of a filtering iterator that developers of non-optimized On the other side, it would still hide whether support for filtering at the persistence layer (optimized) is present within any given All in all, I don't think in-memory filtering should be treated as an equal alternative to persistence-levels filtering. |
I like your idea of adding a different method to Source. From the perspective of letting consumers know about whether support for persistence-level filtering is present within a given With regards to what you propose, I like the general direction in which you're going and I definitely agree that sorting does belong to this discussion. However, I would be wary of adding too much complexity to an interface that is otherwise wonderfully simple. In the case of the All this is strictly my opinion. I am happy to push the conversation forward. Thank you both for pitching in! EDIT: grammar fixes |
Sure, that's true. I would also be fine with either option.
True, for this specific feature, just doing this would be the simplest thing. Let me give a simple example: const source = MySource();
source.matchExpression(namedNode('...'), null, variable('o'), null, {
operator: '<=',
args: [
variable('o'),
literal('2', namedNode('http://www.w3.org/2001/XMLSchema#decimal'))
]
}); Spec-wise, this would require defining a list of the allowed operators, and what kind of args they allow. |
@rubensworks I intuitively agree and also defer to your expertise. I would suggest keeping the list of operators as small as possible in order to be able to make support for all operators mandatory, at least in the beginning. If an operator is there, all implementors of either the additional interface or the additional method should support it. |
Let's wait a day or so for others to have a chance to pitch in and then I'll update the PR. |
I think it would be best if people express the expectations on this API. Then we can define requirements and derivate proposals for an API. Some thoughts after the first feedback:
About the feedback to my proposal. As I said, it's very quick hacked together and with some slight changes, but keeping the core concept, it can handled all mentioned requirements.
That would be possible by adding a method to check if the given arguments will use dedicated filter code. I fear more that your proposal will require to implement every kind of filter, but maybe one implementation only likes to handled time series and therefore only filters ranges of datetime.
I was just lazy. I'm also for a new interface and maybe even different method.
It's actually the other way around. With the filter logic in the
Could be added with an additional method.
But that's already the case with the current definition of the |
Would it help to bring back idea of low level and high level interfaces? I have impression that what @rubensworks proposes looks like lower level interface while @jacoscaz's looks little more high level. source.matchExpression(null, null, variable('o'), null, {
operator: '<=',
args: [
variable('o'),
literal('2', namedNode('http://www.w3.org/2001/XMLSchema#decimal'))
]
}); seems to map roughly to store.match(null, null, {
lte: factory.literal("2", "http://www.w3.org/2001/XMLSchema#decimal"),
}, null);
I think we need extensible and discoverable interface, possibly some stores would want to implement very specific feature, eg. geospatial queries. |
I agree and, considering this PR is for a low-level spec, I am inclined to disregard my own proposal in favor of something more foundational like what @rubensworks is proposing.
I agree, and I think we could reconcile this with the above by adding a dedicated method to detect support for a given operator, as per @bergos suggestion. I would also simplify @rubensworks 's
Then, I would add something like:
which would return EDITED: dropped |
@bergos @rubensworks @elf-pavlik would the following make everyone happy? A
A
The way I am looking at it, the two of these combined get us the following:
What do you all think? EDITED: I forgot - to accurately represent what is going on, I would also foresee the development of a shared |
supportsMatchExpression@jacoscaz I very much like the The only concern I have here is if we don't need the S/P/O/G parameters in here anymore. This is related to my second comment. matchExpressionOn the one hand, I like your more compact proposal. On the other hand, I think this may be too restrictive to handle certain use cases.
So I don't really agree with the following comment:
As the example above shows, the compact proposal is not as expressive as the previous proposal. So the variable-based proposal can not be implemented based on the compact proposal. As I see it, the more compact representation would be a very valuable more high-level representation that could be implemented internally using the previous (low-level) proposal. Is my concern clear to you? Transform
Perhaps we can discuss this in a separate issue/PR that would be a dependency of this PR? |
supportsMatchExpression
Happy to hear this! Another place where we're starting to have consensus! matchExpression
Yes, and thank you for taking the time exemplify your argument - I honestly appreciate it. I think I understand the extent of the limitation that my "compact proposal" brings. To sum it up, it does not allow multi-term filtering expressions within the same query. That was intended on my part in order to try and keep the Complexity, I think, is one of the major pain points in RDF-land and I might very well be exceedingly biased against it. All this said, I'd be happy to align with the less-concise version of Transform
Yes, although I think that is a discussion that we should enter only if we generally agree on the idea and on its complementary nature to what we're discussing in here. |
Upon reflection, the proposed |
matchExpression
Exactly.
Indeed, I have not seen that yet either in JS.
I agree with your assumption that multi-term filter expressions are not going to be supported often by sources. However, that does not mean that none of them will support it. IMO, the purpose of the RDFJS low-level APIs should be to enable implementations with a high level of (potential) expressivity, without making too much assumptions. In the world of SPARQL query optimization, pushing down filters (in their most general form) to the storage layer is a very typical form of optimization. So I think we should remain as open as possible there. I'm not sure if Apache Jena supports pushing down of filters in ARQ (I would assume they do). If so, perhaps we can have a look at their API to get some inspiration whether or not we are on the right track. |
I'm still afraid of what could be perceived as an overly-complex API (wrongfully, but that's often the case) but I appreciate your point and I'm happy to go along with your proposal. Two open questions for me:
|
Let's let this rest a bit to hear what other people think about this.
I would say yes.
Instead of throwing an error, if (source.supportsMatchExpression(variable('s'), null, variable('o'), null, {
operator: '<=',
args: [
variable('o'),
variable('s)
]
})) {
return source.matchExpression(variable('s'), null, variable('o'), null, {
operator: '<=',
args: [
variable('o'),
variable('s)
]
});
} else if (source.supportsMatchExpression(variable('s'), null, null, null, {
operator: '<=',
args: [
literal('0'), // Just a dummy literal
variable('s)
]
})) {
const objects = source.match(null, null, variable('o'), null));
return objects.map(object => source.matchExpression(variable('s'), null, object, null, {
operator: '<=',
args: [
object,
variable('s)
]
}));
} else {
const all = source.match(null, null, null, null));
// Use full in-memory filtering
} |
I guess this brings in another question. Let's say we have a composite query such as what follows:
This could fail in 3 ways:
I think |
Do we consider that expressions can have other expressions as arguments? In that case I think support for all of those expressions may need to get checked separately first. |
Yes, although I believe it would still be less complex than working out exactly what is supported by the I'll go back to my last example:
For this query, I've hypothesized three failure modes:
If I were constrained to using the
The |
Do you mean figure out as developer trying queries manually or you would want to write code which based on those reasons would try different queries? |
@elf-pavlik both, although a developer would presumably look at the documentation for the given |
I don't have experience with writing query engines. Simple example provided by @rubensworks, trying specific queries in order, looks quite approachable. I'd like to see comparable snippet where code tries to create query based on reason from failing support check. On the other hand, if reason could stay optional I don't see big difference between returning |
@elf-pavlik I was trying to write such a snippet and, in the process of doing so, figured I am probably wrong. I was under the impression that @rubensworks query might have been too easy to progressively simplify as it uses a common operator ( However, assuming simplification can always be done from the outside to the inside, peeling back each layer of the query, perhaps even the most complex of queries could simply be dissected into supported sub-queries joined by in-memory implementations dealing with the unsupported outer expressions. Incidentally, I think this lays out a really elegant architecture for a recursive function that walks the expression tree and stops the recursion whenever it finds a supported expression. @rubensworks - seems like you already had this proposal in your head! |
I still favor having a You are mainly talking about the use case of a query engine with only known types of operations. The I thought it might be good to try the concept, that's why I wrote a quick proof of concept which you can find here. |
First of all, thank you for your example.
True, although I think this would happen anyway in your
This is very true and part of the reason why I think we need different methods (i.e. we should not pollute
I am not sure I follow. As a developer who is thinking about how to write a query, I need a-priori knowledge of which operators are supported by the database I am working with. As a developer of the database itself, I need to choose which operators to optimize for from a list of known operators. I guess what I am trying to say is that custom filters are bound to be handled in-memory as databases cannot optimize for filters they are not aware of. If we are to support optimization, we need to produce a list of available operators / expressions for which implementors might choose to optimize. All this said, I am strictly focusing on the |
Yep, that's what I had in mind, you just explained it much better than I did 🙂.
👍
I agree with this. The |
@bergos I don't see them as limitations but more as bad ergonomics.
Considering the low-level, close-to-the-metal nature of the There should be an explicit difference and a clear separation of concern between what can be handled at the persistence level and what can be handled in-memory. In my mind, the goal of the I look at That said, I find |
Do you mean comparing two terms of the same quad? It's even much easier than in your proposal: store.matchFilter(rdf.variable('s'), null, filter.gt(rdf.variable('s'))
No, you can ask before with
That was never the intention of the
There is already the I really would like to have more advanced filtering in the |
I agree with @jacoscaz. I don't see any limitations with your proposal. I consider both proposals having a similar level of expressivity.
How so? AFAICS they are both as expressive. Unless I'm missing something?
I definitely agree that this (proposal 1) is easier to write than something like this (proposal 2): source.matchExpression(variable('s'), null, variable('o'), null, {
operator: '<=',
args: [
variable('o'),
variable('s')
]
}); However, I think we're comparing different levels of abstractions here. As I see it, the main (remaining) difference between proposal 1 and 2 is on how they handle expressions that are not supported by the Am I correct in my analysis of these differences? An important advantage of using proposal 2 as a basis I just thought of is the fact that it can easily cross programming language borders, due to it being purely declarative, and is not using any JS functions. This would be extremely important when running within Web browsers and you want to delegate certain filtering to a WASM source, or when running in Node.js and you need to communicate with C/C++ over NAN.
Don't worry, we'll get there 🙂 |
I personally do not find what you suggest any easier to grasp than having the expression represented separately from the terms. This is probably a simple matter of taste and/or habit, though. I terms of expressive power, I cannot see any difference between the two.
That is not a direction I would like to take. I appreciate what RDF/JS is trying to achieve and I am trying to contribute to the same goal. I might argue against or towards specific solutions, even passionately at times, but that's because I honestly believe they represent superior compromises. I hope I have not come across as exceedingly or needlessly confrontational; if so, that was not my intention and I apologize. I fundamentally agree with the following points @rubensworks has just made:
|
Sorry, but I'm a bit tired of repeating stuff again and again, spending time to show alternatives and nobody takes the time to have a look. It seems to me that some people are only interested to make the interface as close as possible to their existing code and ignore other cases. First one important fact: Everyone has to implement the match method, no matter if it runs accelerated or not. That's already part of the current spec. Adding support for TermFilter would be very easy and I showed that already in my PoC.
The TermFilter proposal contains all properties of the other proposal, the only difference is an additional method. That means you have your declarative structure. And again you are focusing on query engines. There are many use cases where a simple match is enough, still the TermFilter proposal would contain everything required for your query engine use case.
You have all the properties for code that runs accelerated. Anyway you have to translate the Quads again in JS to have full featured RDF/JS Quads. The RDF/JS spec has been written for JavaScript and it was clearly stated that the spec focuses on JS only. I agree that we may align it a little bit to WASM use cases if possible, but it doesn't make any sense to give up a consistent API for a filter when all the Quads don't have a plain object structure. If there is any benefit for a plain object, it doesn't matter compared to the Quads. I haven't done much WASM stuff yet, but anyway I guess it's a common pattern to convert plain objects to full featured classes in both ways. |
In the interest of clarity and transparency, I did look at your example and I am not trying to make the interface as close as possible to my code as I don't even have a well-established codebase to refer to.
Agreed and I think we are all in agreement about this.
I agree that this is easy, although it still doesn't seem practical or ergonomic to me. I have already explained why and I don't think it would help if I were to repeat myself. I understand we do not agree on this and I am happy to come to your side if this means pushing this matter forward. The expressive power of the two solutions is still the same AFAIU. In the spirit of building consensus towards a solution we can all agree with, a few questions:
|
I think it would come helpful if we pick 6 different real world examples with increasing level of complexity and than write equivalent snippets using both of discussed approaches. Based on that we may have better ground to evaluate pros and cons of those alternative approaches. I think any of your three could propose first (simplest) example and then next person would add another little more complex and so on. |
@elf-pavlik that's a worthy exercise, I think. Here's a very simple example: source.matchExpression(null, nameNode('lastSeen'), variable('o'), null, {
type: '>=',
args: [
variable('o'),
literal('2020-01-01', 'xsd:dateTime'),
],
}); source.matchFilter(
null,
namedNode('lastSeen'),
filters.gt(literal('2020-01-01', 'xsd:dateTime')),
null,
); At this level of complexity, I think the two alternatives are almost 100% equivalent in terms of ergonomics and practicality. Actually, at this level of complexity I think @bergos' form is better as it is more consistent with the rest of the spec, although I would still maintain that the |
I'm sorry you feel that way.
Easy from who's perspective?
Almost, there's still the difference in how the expressions are represented.
Indeed, and that's the point we disagree on. The I'm of the opinion that this test method can easily be moved to a higher level, Another major problem with this test method I just realized is that it requires
The difference with the
The list of operators seems like a must to me indeed.
Yes, I agree that this would be nice to have.
That may indeed be a helpful exercise :-) Let me continue upon @jacoscaz's snippet: source.matchExpression(null, nameNode('lastSeen'), variable('o'), null, {
type: '&&',
args: [
{
type: '>=',
args: [
variable('o'),
literal('2020-01-01', 'xsd:dateTime'),
],
},
{
type: '<=',
args: [
variable('o'),
literal('2021-01-01', 'xsd:dateTime'),
],
}
],
}); source.matchFilter(
null,
namedNode('lastSeen'),
filters.and(
filters.gte(literal('2020-01-01', 'xsd:dateTime')),
filters.lte(literal('2021-01-01', 'xsd:dateTime'))
),
null,
); This is still doable in both approaches. Note that I don't consider the first option to be more complex than the second. source.matchExpression(null, nameNode('lastSeen'), variable('o'), null,
filters.and(
filters.gte(variable('o'), literal('2020-01-01', 'xsd:dateTime')),
filters.lte(variable('o'), literal('2021-01-01', 'xsd:dateTime'))
)
); Perhaps we can try an expression with two variables as next use case? |
From my readme:
Maybe.
That doesn't matter much, cause helper functions could be used so they look the same in the end. It's more important what happens after the method call. That should be compared with cases where only the
Let me quote myself:
and
and
That's a comparison of apples with pears. With the TermFilter proposal you would only forward the known filters to the WASM code and that means there is no need to know the logic of the |
In one of earlier comments I see one example, I've change operator to make comparing with IRI of subject more realistic: source.matchFilter(
variable('s'),
null,
filter.eq(variable('o'),
null
) source.matchExpression(variable('s'), null, variable('o'), null, {
operator: '==',
args: [
variable('o'),
variable('s')
]
}); Looks like something that finds statements where node has some relation with itself. Could someone provide more interesting example with two variables? @jacoscaz in this example you provided some time ago I don't see how source.matchExpression(variable('s'), null, variable('o'), null, {
operator: 'AND',
args: [
{ operator: '>=', args: [ variable('o'), variable('s') ] },
{ operator: 'REGEXP', args: [ variable('o'), literal('regex here') ] },
]
}) |
WRT the IRI issue, I am under the impression that RDF also supports "reverse" predicates, something like:
If this is the case, then subjects can be IRIs, blank nodes, literal, ... I am 100% sure I have used at least one store supporting reverse predicates but that might have been an exception to the rule. WRT the
I had missed this as I was trying to quickly locate the bits of code I wanted to think about. Looks like we all agree on this point - good!
I disagree. This matters to me, although I respect the fact that it does not to you. I will build upon @elf-pavlik 's example: source.matchExpression(variable('s'), null, variable('o'), variable('g'), {
operator: '&&',
args: [
{ operator: '==', args: [ variable('o'), variable('s') ] },
{ operator: '==', args: [ variable('g'), namedNode('http://some/graph') ] },
]
}); source.matchFilter(
variable('s'),
null,
filter.eq(variable('s')),
filter.eq(namedNode('http://some/graph'))
) With the first, more declarative style, the entirety of the expression is contained within the fifth parameter. This makes it a lot easier for me to read (and this is subjective) but also to parse into an expression tree. With the second style, I find it harder to understand what the combined effects of all filters is. Parsing them also seems a bit harder. Generally speaking, this is an example of what I refer to when I use the term ergonomics. However, the two could be combined in the following: source.matchExpression(variable('s'), null, variable('o'), variable('g'), filter.and(
filter.eq(variable('o'), variable('s')),
filter.eq(variable('g'), namedNode('http://some/graph')),
)); I find this rather appealing, and this form would also lend itself better when looking at building the code that would concatenate all the
I think the main point of contention is whether the Given an implementation of EDITED: fixed a leftover/redundant/bad use of |
https://www.w3.org/TR/rdf11-concepts/#section-triples clearly states:
One could use tricks like ex:jacosName
a rdfs:Literal;
rdf:value ""jacopo"@it .
ex:jacosName ex:firstNameOf http://jacoscaz . And sort of get literal in subject position, but I really don't see it used there directly. |
@jacoscaz could you please use language tags in your snippets for syntaxt highlighting? I let myself edit some of your recent comments s/```/ ```js/g |
I'm really happy to stand corrected! This wonderfully simplifies a number of things on my end. And yes, I will be more careful with using language tags. Thank you fixing my posts. |
I'm for closing this PR, because in most comments an approach is discussed that is against the core idea of the RDF/JS specs to define JavaScript-idiomatic APIs. If people would like to discuss the basic concepts of the RDF/JS spec, this PR is not the best place. Other PRs to add more advanced filtering are very welcome, but use cases covering the full scope of the |
Good point, I've created #17 for this.
I would actually keep this PR open, and update it with the parts that we all agree on already. |
@bergos and all Before closing this PR, I would kindly ask you to consider what follows. In my previous comment I have tried to reconcile the two approaches by proposing the following: source.matchExpression(variable('s'), null, variable('o'), variable('g'), filter.and(
filter.eq(variable('o'), variable('s')),
filter.eq(variable('g'), namedNode('http://some/graph')),
)); In my mind, this gives us:
I would also add:
Could this make for a good, if perhaps incomplete, compromise? |
@bergos @rubensworks if you have time to do so I'd appreciate your feedback on the above. Then, if everyone agrees, I'll probably close this PR and open a new one starting from what we find ourselves in agreement about and the remaining open questions. |
@jacoscaz I agree with the above, except for the following:
I'm still of the opinion that this should become part of a different spec, or at least should become optional.
Same as above. (Apologies for the delayed answer) |
As I stated out in issue about removing |
Hi all. Coming back to this after a few busy weeks. The more I read and think about it, the more I agree with @bergos proposal of having different interfaces. I'm going to quote a few paragraphs from the other discussion:
I do not agree with this limitation to the definition of a query engine. Selecting which index is better suited for a given query definitely falls within the scope of a query engine.
I agree with the fact that custom filters should be supported but, due to their very nature, they cannot be optimized for at the persistence level.
This necessarily requires a descriptive approach to representing filters in such a way as to be easily translated across languages. Not doing so would greatly limit the number of viable persistence solutions. Based on all this, it looks to me like custom filters and persistence-level optimization cannot be reconciled into a single interface. However, it also looks to me like they do fit within a hierarchy:
Both the base filtering source and the advanced filtering source can be designed to support idiomatic filters and the serialization of these filters into declarative data structures. Filter implementations would also have to be split across two levels:
What do you think @bergos and @rubensworks ? Also @elf-pavlik ? Could this be a viable approach? |
@jacoscaz, the distinction between a base and advanced filter interface sounds like a good solution to me! |
Anybody else? I should be able to work on a new PR in the next few days but I wouldn't want to spend too much time on it without a decent chance at reaching some level of consensus. |
@jacoscaz Ok if we close this? And work on a new version in https://github.com/rdfjs/query-spec/ ? |
@rubensworks yes, absolutely! |
This is a very early draft aimed at adding support for range-based matching to the
Source
interface.The rationale for this change is that the
Source
interface is often implemented by persistence libraries (such asquadstore
) that are often used in contexts where being forced to perform in-memory filtering of quads can lead to unacceptable query latencies and CPU utilization, particularly when compared to the alternative of delegating such filtering to the persistence layer of choice (such asleveldb
in the case ofquadstore
).I am unfamiliar with the spec process,
ReSpec
andWebIDL
and I am uncertain as to how to represent the fact thatRangeSource
extendsSource
, overridingmatch()
to add support forRange
arguments. This PR is meant as a starting point to push the conversation forward on these matters. All feedback is welcome.Thank you @bergos for pitching in on Gitter!