Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document viewer #486

Open
wants to merge 38 commits into
base: main
Choose a base branch
from
Open

Document viewer #486

wants to merge 38 commits into from

Conversation

blakerosenthal
Copy link
Contributor

  • Add API endpoints for document content
  • Add source content viewer to UI

Closes #466

Comment on lines 226 to 235
with get_session() as session:
_, metadata = database.get_document(session, user=user, id=id)
if "path" not in metadata:
raise HTTPException(
status_code=400,
detail="Document path not found",
)
with aiofiles.open(metadata["path"], "rb") as file:
while content := await file.read(1024):
yield content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our documents have a .read() method that should be used here:

@abc.abstractmethod
def read(self) -> bytes: ...

It is not async for now, but there is no need to require documents to be on the local file system. We just need the conversion to the Document object

def to_core(self) -> ragna.core.Document:
return ragna.core.LocalDocument(
id=self.id,
name=self.name,
# TEMP: setting an empty metadata dict for now.
# Will be resolved as part of the "managed ragna" work:
# https://github.com/Quansight/ragna/issues/256
metadata={},
)

Maybe this is also a good time to resolve this TODO first?

ragna/deploy/_ui/api_wrapper.py Outdated Show resolved Hide resolved
@blakerosenthal
Copy link
Contributor Author

@pmeier I have two outstanding questions before being ready for a full review:

  1. Are we only supporting PDFs right now, or do we want to include all supported types? It doesn't seem too difficult to just send any blob to the browser and let the browser decide whether to display it or offer it as a download.
  2. I'm currently just calling the Document.read() method and sending the entire blob to the browser wrapped in a fastapi.Response. I'd be interested in your thoughts on how this should be chunked and streamed properly.

Here's a video of how it looks so far.

Screencast.from.2024-08-26.11-33-04.mp4

@pmeier
Copy link
Member

pmeier commented Aug 26, 2024

  1. Are we only supporting PDFs right now, or do we want to include all supported types? It doesn't seem too difficult to just send any blob to the browser and let the browser decide whether to display it or offer it as a download.

Does the browser handle this automatically? Meaning, we just pass it a blob and the browser figures out whether it can show it (.pdf, .md, .txt) or offer it as download (.pptx, .docx)? If yes, that would be awesome and we should really have this feature for everything.

  1. I'm currently just calling the Document.read() method and sending the entire blob to the browser wrapped in a fastapi.Response. I'd be interested in your thoughts on how this should be chunked and streamed properly.

FastAPI has a FileResponse that handles this properly. Downside is that this requires a file on disk to work. Short-term I'd be ok to just read the data and write it to a temporary file that we remove in a background task after the response has gone out. Long-term we need to build our own response type that does the same, but starts from bytes or an iterable thereof.

Here's a video of how it looks so far.

Looks good. #466 also states that

Ideally the new pane would scroll to and highlight the exact source content in the file.

Is that possible / planned?

@blakerosenthal
Copy link
Contributor Author

blakerosenthal commented Aug 26, 2024

Does the browser handle this automatically? Meaning, we just pass it a blob and the browser figures out whether it can show it (.pdf, .md, .txt) or offer it as download (.pptx, .docx)? If yes, that would be awesome and we should really have this feature for everything.

I thiiink so as long as the right mimetype is specified in the response header. I will confirm and let you know what I find out.

FastAPI has a FileResponse that handles this properly. Downside is that this requires a file on disk to work. Short-term I'd be ok to just read the data and write it to a temporary file that we remove in a background task after the response has gone out. Long-term we need to build our own response type that does the same, but starts from bytes or an iterable thereof.

Okay, I'd be interested in seeing the level of effort on building our own. It seems like FileResponse is mostly just a wrapper around StreamingResponse that sets the right headers.

Ideally the new pane would scroll to and highlight the exact source content in the file.

Is that possible / planned?

I need to think a bit more about how this could work. Might be best suited for a followup PR.

@blakerosenthal blakerosenthal marked this pull request as ready for review August 29, 2024 15:35
@blakerosenthal
Copy link
Contributor Author

@pmeier This is ready for another look! This is was the source viewer looks like now after putting the accordion widget back and adding a new button. And I added some MIME types to our supported documents so the browser knows what to do with them when it receives the blob. On my browser PDFs open in a new tab, text is displayed as HTML, and Word and Powerpoints are downloaded. For sources with page numbers there's also now an anchor that scrolls the view to the right page for in-browser sources.
Screenshot from 2024-08-29 08-37-16
Screenshot from 2024-08-29 08-37-30

ragna/core/_document.py Outdated Show resolved Hide resolved
ragna/deploy/_api/schemas.py Outdated Show resolved Hide resolved
ragna/deploy/_ui/central_view.py Outdated Show resolved Hide resolved
ragna/deploy/_ui/central_view.py Outdated Show resolved Hide resolved
Copy link
Member

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks Blake! One question though: in case the browser downloads the file, I get something like "{uuid.UUID}.docx". Why isn't this the proper file name?

@blakerosenthal
Copy link
Contributor Author

This looks great, thanks Blake! One question though: in case the browser downloads the file, I get something like "{uuid.UUID}.docx". Why isn't this the proper file name?

That's a typo, thanks for catching!

@@ -24,6 +24,15 @@ class DocumentUploadParameters(BaseModel):
data: dict


_MIME_TYPES = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, let's use mimetypes from the standard library.

):
self.id = id or uuid.uuid4()
self.name = name
self.metadata = metadata
self.handler = handler or self.get_handler(name)
self.mime_type = mime_type or self.parse_mime_type(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to store this in the DB as well. Otherwise any MIME type set by the user will be overridden as soon as we pull it from the DB, because the default would be used.

Base automatically changed from corpus-dev to main December 20, 2024 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants