Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug/WP-354: Workspace search - filter results visible to user #893

Merged
merged 7 commits into from
Nov 7, 2023

Conversation

chandra-tacc
Copy link
Collaborator

@chandra-tacc chandra-tacc commented Nov 2, 2023

Overview

Workspace search is crashing when parsing results.
Root cause:

  • Results are returned from search index and not the tapis api response results. For example, some items in search index does not have certain fields (like name) indexed.
  • Results are retrieved from all users. For example: searching for "test" brings various results from search index which should not be visible to the user.

Related

Changes

  • Use the search results and find the matching ones from api. Provide the api result json instead of what is indexed.
  • Search results are filtered to what is visible to the user.
  • Add protection in js code to avoid crash

Testing

Local testing:

  1. with search match
  2. with no matches
  3. with no query

Testing on dev.cep:

  1. Crash with out the fix. Search for a commonly used file name "test" which brings test results from other users.

  2. No crash with the fix and results are specific to the user. See recording below:

with_fix.mov

UI

No change

Notes

There were no tests written for this feature previously, will start working on that separately.

Copy link

codecov bot commented Nov 2, 2023

Codecov Report

Merging #893 (681aac7) into main (b51c41f) will decrease coverage by 0.04%.
The diff coverage is 11.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #893      +/-   ##
==========================================
- Coverage   63.44%   63.41%   -0.04%     
==========================================
  Files         432      432              
  Lines       12383    12389       +6     
  Branches     2576     2579       +3     
==========================================
  Hits         7856     7856              
- Misses       4317     4323       +6     
  Partials      210      210              
Flag Coverage Δ
javascript 69.76% <100.00%> (ø)
unittests 56.94% <0.00%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...es/DataFilesProjectsList/DataFilesProjectsList.jsx 94.87% <100.00%> (ø)
...orkspace_operations/shared_workspace_operations.py 17.79% <ø> (ø)
server/portal/apps/projects/views.py 33.92% <0.00%> (-1.93%) ⬇️

@chandra-tacc chandra-tacc marked this pull request as ready for review November 2, 2023 16:19
Copy link
Contributor

@shayanaijaz shayanaijaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gotten the chance to test this locally yet but it looks like this will work. I had a suggestion for an alternate solution and would like to know your thoughts.

The solution I'm thinking of would follow closely what is done currently for data file searches in operations.py.

If there are current project fields that are not being indexed then we should add those to the IndexedProject document (though that shouldn't be the case since all fields in a project are used when indexing).

We can then add a pattern analyzer for the id field in IndexedProject document and add a search filter that would look something like
search = search.filter('prefix', **{'id._pattern': f'{settings.PORTAL_PROJECTS_SYSTEM_PREFIX}.*'})

This should filter the projects correctly and retrieve only the projects for the current user and portal. This would also avoid us having to get the list of projects again and do filtering on it when performing a search. But then again, I don't think the number of projects gets very large (I could be wrong) so this approach should still be fine. Let me know what you think

Copy link
Member

@rstijerina rstijerina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chandra-tacc
Copy link
Collaborator Author

chandra-tacc commented Nov 2, 2023

I haven't gotten the chance to test this locally yet but it looks like this will work. I had a suggestion for an alternate solution and would like to know your thoughts.

The solution I'm thinking of would follow closely what is done currently for data file searches in operations.py.

If there are current project fields that are not being indexed then we should add those to the IndexedProject document (though that shouldn't be the case since all fields in a project are used when indexing).

We can then add a pattern analyzer for the id field in IndexedProject document and add a search filter that would look something like search = search.filter('prefix', **{'id._pattern': f'{settings.PORTAL_PROJECTS_SYSTEM_PREFIX}.*'})

This should filter the projects correctly and retrieve only the projects for the current user and portal. This would also avoid us having to get the list of projects again and do filtering on it when performing a search. But then again, I don't think the number of projects gets very large (I could be wrong) so this approach should still be fine. Let me know what you think

Thanks for references. tapis system restriction will enforce the results are scoped to the system. And for files there is check to see if the user has access to a path.
In case of workspace, I'll try to use the system and see if that resolves the issue. My understanding from reading code is, the search index will return all indexed workspaces in that system. It could be from other users. Hence the filtering.

Regarding, fields missing in index, I do not know how it ended up in that state. From reading results from search, it was old workspace, not sure if something was migrated or indexed using older mechanism. Still, your idea on pattern analyzer is useful. Will try out.

if hits:
client = request.user.tapis_oauth.client
listing = list_projects(client)
filtered_list = filter(lambda prj: prj['id'] in hits, listing)
Copy link
Member

@nathanfranklin nathanfranklin Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[not tested or reviewed] but would filtering the results break the usage of the offset/limit params? like a user would get filtered results that were less than the requested limit which implies that there are no more search results (i.e. no reason to bump offset) when there could be more search results.

so #893 (review) could be a good approach

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathanfranklin
Yes, that is a good point. It only shows within the offset. list_projects by default gets results by offset - 50. And so, it will have results less than search results.
The search index has all user results.
Using settings.PORTAL_PROJECTS_SYSTEM_PREFIX (mentioned in comment above), will restrict system and NOT user. So, it will still return result for all users.

So, any search will have to check if a workspace is available to a user. Access controls are handled by tapis, for example: is a workspace owned, or shared, and access control logic is handled by tapis.
So, to provide accurate results -

  1. we use tapis results - retrieve all workspaces without a filter for a user.
    Pros: This will work.
    Cons: Defeats the purpose of offset and search. Bad perf.
  2. Replicate the access control logic for search in portal. Use the search results and filter it locally for the current user. I haven't checked how sharing is organized.
    Pros: better perf
    Cons: Can sharing be checked and duplication of access control logic.

Also, please let me know if I'm wrong about system prefix based search. That only filters a system and not user. It is tough to test this locally, with only one user in my local dev env.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding data files search mentioned above: there is a listing request sent to tapis to ensure user has access before proceeding to search. So, it does search and tapis listing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with filtering on system prefix. The good part about this is it does not crash the UI, just as suspected the previous result was brining in results from older system, which had different index and hence the crash.

And as expected, it is bringing in results from other users in the system. Even though I have no access, i can see others projects listed. I have 3 projects that match test keyword, but it brings in all other projects.

Screenshot 2023-11-02 at 6 55 43 PM

One possible solution to this: Store user metadata in the index, in addition to owner, store list of users with access to this. Essentially, something like this: owner: {}, users: [{}].
And I do not see tapis api listing all other users with access. So, everytime a user accesses this, the index needs to be updated with user added to the list.
The old index for project stored - PIs and CO-PIs.

The only downside of this (or for every elastic search index): the data consistency between index and db. If tapis is updated - a project is unshared, but not via portal, then index could go out of sync. I do not know if this is an important scenario, but something for us to know, in case a portal has strictness on data access controls.

Note: I'll bring back the previous code which cross checks tapis with search results, and keeping the system check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are only filtering the index based on the user query and system prefix I can see the issue that Chandra encountered happening where results from other users are returned. I don't think it would be feasible to try to replicate and manage the system access control in the index and would rather have tapis remain the source of truth. So I think Chandra's current solution with getting a list of the users projects should work fine.

With regards to the offset and limit stuff, I don't think we are doing any kind of pagination on the project listing page so the offset and limit are not really needed. For reference there is no offset or limit parameter sent in fetchProjectListing and on the project listing page there is nothing configured for infinite scrolling. The default tapis limit when getting systems is 100 so I guess that's the only one used. This line might not even be needed at all.

Copy link
Member

@nathanfranklin nathanfranklin Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent. so drop offset/limit from endpoint 👍

The default tapis limit when getting systems is 100

So we need to check if list_projects returns a subset or all of the projects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_projects does a tapis getSystems call which by default returns 100 projects which should be enough I think? Also according to the docs we can set the limit to -1 to get all the projects which might work better for us

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to the docs we can set the limit to -1 to get all the projects which might work better for us

perfect. great that they have that built-in with the -1. yeah, we'll definitely exceed 100 for a user at some point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the heuristic used in the past to decide whether offset/pagination is needed on a list? Should we check perf or suggestions from tapis team to not overload their services with large data requests?
Regarding UI and response payload, the information returned for project is not huge like jobs - we have name, id, owner, and small metadata size.
Each job has 90+ attributes vs project has < 10 attributes (250 B per project in payload). If that is a good enough criteria to justify using -1, then I can go ahead and use it. The memory size for 100 would be approx. 30 kB.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used -1 as limit.

Copy link
Contributor

@shayanaijaz shayanaijaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for making the changes

@chandra-tacc chandra-tacc changed the title Workspace search - result filtering Bug/WP-354: Workspace search - filter results visible to user Nov 7, 2023
@chandra-tacc chandra-tacc merged commit 7c0ba55 into main Nov 7, 2023
4 of 6 checks passed
@chandra-tacc chandra-tacc deleted the bug/WP-354-search-crash branch November 7, 2023 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants