-
Notifications
You must be signed in to change notification settings - Fork 339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload Test Data Set #37
Comments
The SerDe stuff, to get DB docs to string and vice-versa, could be done with:
Trickier part is getting the dumped string into a file that we can serve to the user to download to their system. I don't think we can just write a file to the extension then create a link on that. Another option is to programmatically upload it to a website and point the user to that to download. This seems dumb however because the file may be very big and needs a secure solution. Will look around for some way to create a file for download from an extension. Maybe it's not so difficult. Accepting the file to import should be able to be done with a file input and some container code passing it to |
Actually pretty easy. We can create a new URL for an in-memory blob (from the |
From a UI POV, we could add this as a separate module in the settings with the title "Backup and Restore Database" |
A bit more tedious but maybe we can upload it to the user's Google Drive and give that URL to the user. It may be undesirable for large databases though. We could have it as an option maybe. |
Great idea and feature upgrade! |
Good bit of initial progress on looking into this further today. Made up a little prototype to confirm my understanding of the pouch libs + the File/FileReader APIs. Never really messed with files much in front-end JS. Mostly working as expected; able to dump pouch to a downloadable text file, and also upload a correctly formatted text file and read the contents to restore from a dump. Can't get the Plans for main functionality:Dump:
Restore:
Other notes/ideas:
|
Any way to avoid this? If the DB is about 1GB big, this will lead to the extension crashing. |
Apart from writing (via node-like stream) to an actual file on the filesystem (as opposed to in-memory file), or having some sort of streamed download, no. Writing to the actual filesystem isn't possible in standard frontend JS, from what I understand. Possibly there's a webextension specific way to interact with a file. Even if it means we have an extra file in the extension just to write dumps to and allow a user to download, it should work, although messy. Will look into these options more today, as yes, it will quickly get out-of-hand with bigger data. RE local storage: there is a limit of ~5MB IIRC. Even with unlimited storage, the file will still need to be created in memory at some point for the user to download. EDIT: this looks promising |
Another alternative could be dumping to many files. We only have memory to work with in browser JS, so instead of writing progress to a single file, we could create an in-memory file for each "chunk" read from the DB (as opposed to collecting all the chunks and making a single big in-memory file). Then use the Chrome downloads API to auto download a single in-memory file each time the stream signals it has data. Played with this in the prototype, and TBH, while it solves the memory issue, it doesn't provide a very nice UX. Just with 88 bookmarks docs imported (+ associated page and visit docs), using ~390KB-5MB (depending on how many page docs have been filled out during imports) it needs 18 separate files. Using a larger set of data (18k+ browser history stubs + all associated data), you get this kind of fun. Could maybe batch it up further, so say after collecting ~15MB of data (arbitrary choice) from the stream, then make an in-memory file, call the Downloads API, then forget about it. Still leaves the user with their DB dump spread across multiple files though, and need to find a way to create an in-order stream from the dump files for the restore process, and feed that back into pouch, else the memory problem still exists for restore... will play with these ideas a bit. |
@BigBlueHat you may have some ideas how to solve this? Thanks for the input! |
Bit of an update on the restoring process I was having trouble with: after spending more time looking into it, it seems to be a bug with Bit disappointing as there don't seem to be any other dump APIs for pouch, and just writing the DB docs to a file isn't recommended for certain reasons relating to how pouch stores data. My proof-of-concept works fine when manually excluding attachments from our docs, but then we lose attachments (favicon, screenshots, freezedry). Having a bit of a read through and play with the upstream Put the proof-of-concept code up on Regarding the actual backup process, I think streaming the DB dump into multiple dump files is the way to go. Without a server, I'm pretty sure this is the only way to do it without loading every single thing into memory. We're very limited with filesystem-related tasks in frontend JS. My restore logic seems to work fine with reading and restoring from multiple files, apart from the attachment issue, which is great. A little disappointing, but unavoidable, thing with this is the time it takes to actually dump: |
Ah bummer! May be worth a look. (I like the encryption features as well, as it could be good addition to our plans of replicating the DB to our servers for the hosting purposes) Regarding the file sizes:So what are the junk sizes for a 1GB DB? In the download process, we could suggest the user takes a new folder as a target for the backup and then when uploading again, he picks the folder with all the blocks and we do the rest. This way it is still just picking one element instead of all the blocks? |
Yes this one did look interesting. I'll spend a bit more time looking at it, as we do seem to have quite a lot of pouch related pain. It's a non-trivial change though, and may complicate things with upstream. Depends on the ratio of amount of work needed for benefits we get, for right now I suppose.
If "junk" means "dump", I'm pretty sure it should be roughly 1GB. The stored size shouldn't be much different to the dump without compression, as it's a full replication dump of the DB.
Yeah that's what I've got the code doing now. Dumps to a |
with junks i meant the actual batch sizes, not the dump as a whole, sorry for the confusion. regarding RXDB:What are the challenges with such a switch? |
With the batch sizes, or dump file sizes, they can be whatever size we want. Basically however much we want to be storing in memory at any given time during the backups/restore process. Maybe 50MB or something (so 21 or so files for 1GB)?
RxDB docs or issues don't seem to mention how they handle pouch attachments or any mention regarding storage of binary data. They may just let pouch handle it at a lower level if you add More concerned about what's available for dumping RxDB, as this was why it was brought up here. They have a single
Yeah those, or making our own dump functionality for Pouch. I'd be more inclined to looking at seeing how difficult it would be to fix this at upstream |
Ok, how about we first get everything running without attachments, then look for how to fix it upstream? @Treora might be interested to collaborate on that fix, as it might touch his plans as well. Regarding a future decision to use RXDB:What clear advantages do you currently see with it?
|
That's not easy to do either. The dump doesn't afford omitting/projecting out data like that, only filtering docs. That means the only way right now would be parsing the dump as it comes in (in newline delimited JSON) on the stream, and manually removing the attachment, before serialising it again to NDJSON for file creation. There can be a lot of data and is already quite slow. This would complicate the process more and result in a lossy backup feature (but maybe better than none?). We can either do it like that for now, or postpone this feature for a bit until we see if we can fix it upstream. RE RxDB discussion: I think better you make another issue for that and everyone can discuss it more there. |
As discussed, for now we reduce the feature to solely import, so we can test better with large data sets. Changing title accordingly to: "Upload Test Data Set" |
Idea with this now, as just an upload for test data set, is to reuse the upload button logic to be able to upload a number of generated dump files. Then have a separate node script to generate the dump file/s using a data generator package, like The script should output new-line delimited JSON dump files, in the PouchDB replication format that is given in here. |
After playing with a little prototype script, it's become obvious that if we want to produce pouch-replication-formatted dump files, we'll need the Pouch revisions ( Script is here for now: https://github.com/poltak/worldbrain-data-generator |
Why don't we use the kaggle data set, as we have real text data there then? The only thing to fake is the visit time then? Also if we use such a standard test set, it's easier for others to be used later. Probably we'd first have to upload it, so everything is put in the format of page objects and visits and then dump it, so it is looking like a real backup? |
I decided to just generate the data as it simplifies things not needing to input, parse, and convert data. The real text data shouldn't really be an issue. However, I think maybe I'll just use it and make the script a converter, as then the script will have a defined input format. Then we can put whatever test data in that format, or even generate it, as long as it conforms to that . Here's the set for reference: https://www.kaggle.com/patjob/articlescrape
A real backup, IMO, should include PouchDB-specific metadata (revisions, versioning data detailing how documents have changed over time), as well as the docs data. However, this depends on how you consider a "backup" in the big scope of the project; maybe just the docs data provides the same outcome for us. So tech-wise, this will just be an importer for docs data (gotten from the conversion script, or wherever), but user/us-wise, it imports data for the extension. |
Alright the converter script is in a usable state. It can convert that kaggle dataset (and other any CSV data with The converter script can be downloaded from npm via @oliversauter Regarding the UI for uploading a test dataset: It's essentially just a file input. User clicks, selects the data files, and it parses them and adds them to the DB. I think maybe should just have a "Developer mode" checkbox on imports page, which when checked shows that input. Seems common in some other extensions I've seen. What do you think? Regarding the script: this may be of use to us in the future if we have a lot of data stored in CSVs that we want to bring into the extension. There's some additional options to do some experimental stuff, like setting the |
Good idea! Lets do it like that. |
Alright simple UI for it is now on imports under a dev mode checkbox in |
As a user I want to export the data from the extension, save it as a JSON (?) file and import it again.
This would make it possible to do testings with larger data sets.
Somehow related to the work done in #19, at least the import part. Since this feature will have to come anyhow, we should take both use cases into account: manual upload & transfer from old extension.
The text was updated successfully, but these errors were encountered: