Bug/WP-354: Workspace search - filter results visible to user #893

chandra-tacc · 2023-11-02T13:10:18Z

Overview

Workspace search is crashing when parsing results.
Root cause:

Results are returned from search index and not the tapis api response results. For example, some items in search index does not have certain fields (like name) indexed.
Results are retrieved from all users. For example: searching for "test" brings various results from search index which should not be visible to the user.

Changes

Use the search results and find the matching ones from api. Provide the api result json instead of what is indexed.
Search results are filtered to what is visible to the user.
Add protection in js code to avoid crash

Testing

Local testing:

with search match
with no matches
with no query

Testing on dev.cep:

Crash with out the fix. Search for a commonly used file name "test" which brings test results from other users.
No crash with the fix and results are specific to the user. See recording below:

with_fix.mov

UI

No change

Notes

There were no tests written for this feature previously, will start working on that separately.

codecov · 2023-11-02T13:12:22Z

Codecov Report

Merging #893 (681aac7) into main (b51c41f) will decrease coverage by 0.04%.
The diff coverage is 11.11%.

@@            Coverage Diff             @@
##             main     #893      +/-   ##
==========================================
- Coverage   63.44%   63.41%   -0.04%     
==========================================
  Files         432      432              
  Lines       12383    12389       +6     
  Branches     2576     2579       +3     
==========================================
  Hits         7856     7856              
- Misses       4317     4323       +6     
  Partials      210      210

Flag	Coverage Δ
javascript	`69.76% <100.00%> (ø)`
unittests	`56.94% <0.00%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...es/DataFilesProjectsList/DataFilesProjectsList.jsx	`94.87% <100.00%> (ø)`
...orkspace_operations/shared_workspace_operations.py	`17.79% <ø> (ø)`
server/portal/apps/projects/views.py	`33.92% <0.00%> (-1.93%)`	⬇️

shayanaijaz

I haven't gotten the chance to test this locally yet but it looks like this will work. I had a suggestion for an alternate solution and would like to know your thoughts.

The solution I'm thinking of would follow closely what is done currently for data file searches in operations.py.

If there are current project fields that are not being indexed then we should add those to the IndexedProject document (though that shouldn't be the case since all fields in a project are used when indexing).

We can then add a pattern analyzer for the id field in IndexedProject document and add a search filter that would look something like
search = search.filter('prefix', **{'id._pattern': f'{settings.PORTAL_PROJECTS_SYSTEM_PREFIX}.*'})

This should filter the projects correctly and retrieve only the projects for the current user and portal. This would also avoid us having to get the list of projects again and do filtering on it when performing a search. But then again, I don't think the number of projects gets very large (I could be wrong) so this approach should still be fine. Let me know what you think

rstijerina

LGTM

chandra-tacc · 2023-11-02T18:00:18Z

I haven't gotten the chance to test this locally yet but it looks like this will work. I had a suggestion for an alternate solution and would like to know your thoughts.

The solution I'm thinking of would follow closely what is done currently for data file searches in operations.py.

If there are current project fields that are not being indexed then we should add those to the IndexedProject document (though that shouldn't be the case since all fields in a project are used when indexing).

We can then add a pattern analyzer for the id field in IndexedProject document and add a search filter that would look something like search = search.filter('prefix', **{'id._pattern': f'{settings.PORTAL_PROJECTS_SYSTEM_PREFIX}.*'})

This should filter the projects correctly and retrieve only the projects for the current user and portal. This would also avoid us having to get the list of projects again and do filtering on it when performing a search. But then again, I don't think the number of projects gets very large (I could be wrong) so this approach should still be fine. Let me know what you think

Thanks for references. tapis system restriction will enforce the results are scoped to the system. And for files there is check to see if the user has access to a path.
In case of workspace, I'll try to use the system and see if that resolves the issue. My understanding from reading code is, the search index will return all indexed workspaces in that system. It could be from other users. Hence the filtering.

Regarding, fields missing in index, I do not know how it ended up in that state. From reading results from search, it was old workspace, not sure if something was migrated or indexed using older mechanism. Still, your idea on pattern analyzer is useful. Will try out.

nathanfranklin · 2023-11-02T21:35:48Z

server/portal/apps/projects/views.py

+            if hits:
+                client = request.user.tapis_oauth.client
+                listing = list_projects(client)
+                filtered_list = filter(lambda prj: prj['id'] in hits, listing)


[not tested or reviewed] but would filtering the results break the usage of the offset/limit params? like a user would get filtered results that were less than the requested limit which implies that there are no more search results (i.e. no reason to bump offset) when there could be more search results.

so #893 (review) could be a good approach

@nathanfranklin
Yes, that is a good point. It only shows within the offset. list_projects by default gets results by offset - 50. And so, it will have results less than search results.
The search index has all user results.
Using settings.PORTAL_PROJECTS_SYSTEM_PREFIX (mentioned in comment above), will restrict system and NOT user. So, it will still return result for all users.

So, any search will have to check if a workspace is available to a user. Access controls are handled by tapis, for example: is a workspace owned, or shared, and access control logic is handled by tapis.
So, to provide accurate results -

we use tapis results - retrieve all workspaces without a filter for a user.
Pros: This will work.
Cons: Defeats the purpose of offset and search. Bad perf.

Replicate the access control logic for search in portal. Use the search results and filter it locally for the current user. I haven't checked how sharing is organized.
Pros: better perf
Cons: Can sharing be checked and duplication of access control logic.

Also, please let me know if I'm wrong about system prefix based search. That only filters a system and not user. It is tough to test this locally, with only one user in my local dev env.

Regarding data files search mentioned above: there is a listing request sent to tapis to ensure user has access before proceeding to search. So, it does search and tapis listing.

I tested with filtering on system prefix. The good part about this is it does not crash the UI, just as suspected the previous result was brining in results from older system, which had different index and hence the crash.

And as expected, it is bringing in results from other users in the system. Even though I have no access, i can see others projects listed. I have 3 projects that match test keyword, but it brings in all other projects.

One possible solution to this: Store user metadata in the index, in addition to owner, store list of users with access to this. Essentially, something like this: owner: {}, users: [{}].
And I do not see tapis api listing all other users with access. So, everytime a user accesses this, the index needs to be updated with user added to the list.
The old index for project stored - PIs and CO-PIs.

The only downside of this (or for every elastic search index): the data consistency between index and db. If tapis is updated - a project is unshared, but not via portal, then index could go out of sync. I do not know if this is an important scenario, but something for us to know, in case a portal has strictness on data access controls.

Note: I'll bring back the previous code which cross checks tapis with search results, and keeping the system check.

Since we are only filtering the index based on the user query and system prefix I can see the issue that Chandra encountered happening where results from other users are returned. I don't think it would be feasible to try to replicate and manage the system access control in the index and would rather have tapis remain the source of truth. So I think Chandra's current solution with getting a list of the users projects should work fine.

With regards to the offset and limit stuff, I don't think we are doing any kind of pagination on the project listing page so the offset and limit are not really needed. For reference there is no offset or limit parameter sent in fetchProjectListing and on the project listing page there is nothing configured for infinite scrolling. The default tapis limit when getting systems is 100 so I guess that's the only one used. This line might not even be needed at all.

excellent. so drop offset/limit from endpoint 👍

The default tapis limit when getting systems is 100

So we need to check if list_projects returns a subset or all of the projects.

list_projects does a tapis getSystems call which by default returns 100 projects which should be enough I think? Also according to the docs we can set the limit to -1 to get all the projects which might work better for us

according to the docs we can set the limit to -1 to get all the projects which might work better for us

perfect. great that they have that built-in with the -1. yeah, we'll definitely exceed 100 for a user at some point.

What is the heuristic used in the past to decide whether offset/pagination is needed on a list? Should we check perf or suggestions from tapis team to not overload their services with large data requests?
Regarding UI and response payload, the information returned for project is not huge like jobs - we have name, id, owner, and small metadata size.
Each job has 90+ attributes vs project has < 10 attributes (250 B per project in payload). If that is a good enough criteria to justify using -1, then I can go ahead and use it. The memory size for 100 would be approx. 30 kB.

Used -1 as limit.

shayanaijaz

LGTM. Thanks for making the changes

Workspace search - result filtering

7ae1f4d

chandra-tacc added 3 commits November 2, 2023 08:22

Merge branch 'main' into bug/WP-354-search-crash

a87745e

add comments

706847d

Add protection in client side processing

f4ee9f5

chandra-tacc marked this pull request as ready for review November 2, 2023 16:19

chandra-tacc requested review from rstijerina and edmondsgarrett November 2, 2023 16:33

shayanaijaz reviewed Nov 2, 2023

View reviewed changes

rstijerina approved these changes Nov 2, 2023

View reviewed changes

nathanfranklin reviewed Nov 2, 2023

View reviewed changes

chandra-tacc added 3 commits November 2, 2023 18:41

Test with system prefix filter on id field

c2830e9

Add back search filtering with tapis listing

a8232d6

Use -1 as limit for listing projects

681aac7

shayanaijaz approved these changes Nov 6, 2023

View reviewed changes

chandra-tacc changed the title ~~Workspace search - result filtering~~ Bug/WP-354: Workspace search - filter results visible to user Nov 7, 2023

chandra-tacc merged commit 7c0ba55 into main Nov 7, 2023
4 of 6 checks passed

chandra-tacc deleted the bug/WP-354-search-crash branch November 7, 2023 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug/WP-354: Workspace search - filter results visible to user #893

Bug/WP-354: Workspace search - filter results visible to user #893

chandra-tacc commented Nov 2, 2023 •

edited

Loading

codecov bot commented Nov 2, 2023 •

edited

Loading

shayanaijaz left a comment

rstijerina left a comment

chandra-tacc commented Nov 2, 2023 •

edited

Loading

nathanfranklin Nov 2, 2023 •

edited

Loading

chandra-tacc Nov 2, 2023

chandra-tacc Nov 2, 2023

chandra-tacc Nov 3, 2023

shayanaijaz Nov 3, 2023

nathanfranklin Nov 3, 2023 •

edited

Loading

shayanaijaz Nov 3, 2023

nathanfranklin Nov 3, 2023

chandra-tacc Nov 3, 2023

chandra-tacc Nov 6, 2023

shayanaijaz left a comment

Bug/WP-354: Workspace search - filter results visible to user #893

Bug/WP-354: Workspace search - filter results visible to user #893

Conversation

chandra-tacc commented Nov 2, 2023 • edited Loading

Overview

Related

Changes

Testing

UI

Notes

codecov bot commented Nov 2, 2023 • edited Loading

Codecov Report

shayanaijaz left a comment

Choose a reason for hiding this comment

rstijerina left a comment

Choose a reason for hiding this comment

chandra-tacc commented Nov 2, 2023 • edited Loading

nathanfranklin Nov 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nathanfranklin Nov 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shayanaijaz left a comment

Choose a reason for hiding this comment

chandra-tacc commented Nov 2, 2023 •

edited

Loading

codecov bot commented Nov 2, 2023 •

edited

Loading

chandra-tacc commented Nov 2, 2023 •

edited

Loading

nathanfranklin Nov 2, 2023 •

edited

Loading

nathanfranklin Nov 3, 2023 •

edited

Loading