-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug/WP-354: Workspace search - filter results visible to user #893
Conversation
Codecov Report
@@ Coverage Diff @@
## main #893 +/- ##
==========================================
- Coverage 63.44% 63.41% -0.04%
==========================================
Files 432 432
Lines 12383 12389 +6
Branches 2576 2579 +3
==========================================
Hits 7856 7856
- Misses 4317 4323 +6
Partials 210 210
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gotten the chance to test this locally yet but it looks like this will work. I had a suggestion for an alternate solution and would like to know your thoughts.
The solution I'm thinking of would follow closely what is done currently for data file searches in operations.py.
If there are current project fields that are not being indexed then we should add those to the IndexedProject document (though that shouldn't be the case since all fields in a project are used when indexing).
We can then add a pattern analyzer for the id field in IndexedProject document and add a search filter that would look something like
search = search.filter('prefix', **{'id._pattern': f'{settings.PORTAL_PROJECTS_SYSTEM_PREFIX}.*'})
This should filter the projects correctly and retrieve only the projects for the current user and portal. This would also avoid us having to get the list of projects again and do filtering on it when performing a search. But then again, I don't think the number of projects gets very large (I could be wrong) so this approach should still be fine. Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for references. tapis system restriction will enforce the results are scoped to the system. And for files there is check to see if the user has access to a path. Regarding, fields missing in index, I do not know how it ended up in that state. From reading results from search, it was old workspace, not sure if something was migrated or indexed using older mechanism. Still, your idea on pattern analyzer is useful. Will try out. |
if hits: | ||
client = request.user.tapis_oauth.client | ||
listing = list_projects(client) | ||
filtered_list = filter(lambda prj: prj['id'] in hits, listing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[not tested or reviewed] but would filtering the results break the usage of the offset/limit params? like a user would get filtered results that were less than the requested limit
which implies that there are no more search results (i.e. no reason to bump offset
) when there could be more search results.
so #893 (review) could be a good approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nathanfranklin
Yes, that is a good point. It only shows within the offset. list_projects by default gets results by offset - 50. And so, it will have results less than search results.
The search index has all user results.
Using settings.PORTAL_PROJECTS_SYSTEM_PREFIX (mentioned in comment above), will restrict system and NOT user. So, it will still return result for all users.
So, any search will have to check if a workspace is available to a user. Access controls are handled by tapis, for example: is a workspace owned, or shared, and access control logic is handled by tapis.
So, to provide accurate results -
- we use tapis results - retrieve all workspaces without a filter for a user.
Pros: This will work.
Cons: Defeats the purpose of offset and search. Bad perf. - Replicate the access control logic for search in portal. Use the search results and filter it locally for the current user. I haven't checked how sharing is organized.
Pros: better perf
Cons: Can sharing be checked and duplication of access control logic.
Also, please let me know if I'm wrong about system prefix based search. That only filters a system and not user. It is tough to test this locally, with only one user in my local dev env.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding data files search mentioned above: there is a listing request sent to tapis to ensure user has access before proceeding to search. So, it does search and tapis listing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested with filtering on system prefix. The good part about this is it does not crash the UI, just as suspected the previous result was brining in results from older system, which had different index and hence the crash.
And as expected, it is bringing in results from other users in the system. Even though I have no access, i can see others projects listed. I have 3 projects that match test
keyword, but it brings in all other projects.
One possible solution to this: Store user metadata in the index, in addition to owner, store list of users with access to this. Essentially, something like this: owner: {}, users: [{}].
And I do not see tapis api listing all other users with access. So, everytime a user accesses this, the index needs to be updated with user added to the list.
The old index for project stored - PIs and CO-PIs.
The only downside of this (or for every elastic search index): the data consistency between index and db. If tapis is updated - a project is unshared, but not via portal, then index could go out of sync. I do not know if this is an important scenario, but something for us to know, in case a portal has strictness on data access controls.
Note: I'll bring back the previous code which cross checks tapis with search results, and keeping the system check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are only filtering the index based on the user query and system prefix I can see the issue that Chandra encountered happening where results from other users are returned. I don't think it would be feasible to try to replicate and manage the system access control in the index and would rather have tapis remain the source of truth. So I think Chandra's current solution with getting a list of the users projects should work fine.
With regards to the offset and limit stuff, I don't think we are doing any kind of pagination on the project listing page so the offset and limit are not really needed. For reference there is no offset or limit parameter sent in fetchProjectListing and on the project listing page there is nothing configured for infinite scrolling. The default tapis limit when getting systems is 100 so I guess that's the only one used. This line might not even be needed at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent. so drop offset/limit from endpoint 👍
The default tapis limit when getting systems is 100
So we need to check if list_projects
returns a subset or all of the projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list_projects
does a tapis getSystems
call which by default returns 100 projects which should be enough I think? Also according to the docs we can set the limit to -1 to get all the projects which might work better for us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to the docs we can set the limit to -1 to get all the projects which might work better for us
perfect. great that they have that built-in with the -1. yeah, we'll definitely exceed 100 for a user at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the heuristic used in the past to decide whether offset/pagination is needed on a list? Should we check perf or suggestions from tapis team to not overload their services with large data requests?
Regarding UI and response payload, the information returned for project is not huge like jobs - we have name, id, owner, and small metadata size.
Each job has 90+ attributes vs project has < 10 attributes (250 B per project in payload). If that is a good enough criteria to justify using -1, then I can go ahead and use it. The memory size for 100 would be approx. 30 kB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used -1 as limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for making the changes
Overview
Workspace search is crashing when parsing results.
Root cause:
Related
Changes
Testing
Local testing:
Testing on dev.cep:
Crash with out the fix. Search for a commonly used file name "test" which brings test results from other users.
No crash with the fix and results are specific to the user. See recording below:
with_fix.mov
UI
No change
Notes
There were no tests written for this feature previously, will start working on that separately.