Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research and improve editor.getData() efficiency #5812

Open
scofalik opened this issue Nov 26, 2019 · 24 comments
Open

Research and improve editor.getData() efficiency #5812

scofalik opened this issue Nov 26, 2019 · 24 comments
Labels
package:engine status:discussion support:1 An issue reported by a commercially licensed client. type:task This issue reports a chore (non-production change) and other types of "todos".

Comments

@scofalik
Copy link
Contributor

scofalik commented Nov 26, 2019

At the moment, whenever getData() is called, the whole data pipeline is used to build the result from a scratch. This is very inefficient in case of big documents. Generating output data takes a visible amount of time (>0.5s) starting from documents of 15+ A4 pages. It means there's annoying lagging.

This causes problems for:

  • Autosave plugin,
  • Watchdog feature,
  • two-way data binding in frameworks,
  • basically any integration/plugin/code that needs to save the data in time intervals.

For really big documents it may also provide UX loss simply when clicking the "Save" button in a form. Consider browser stopping being responsive, or needing to take an additional 5 seconds before sending HTTP request. And I am using a relatively fast MacBook, so in case of a typical user, it could be even worse.

As I pointed out, the biggest reason for that inefficiency is building the data view from scratch on .getData(). If it would be kept up-to-date in the memory, just like the editing view, it wouldn't take that much time to perform getData(). So, it would basically work like the editing pipeline but with different converters. Maybe there's even way to quickly prototype it just by using the EditingController in place of the DataController 🤔?

Keep in mind, that would be quite a breaking change. I think this is a huge issue that will demand a lot of testing.

Other than that, AFAIR, we have might think about improving performance in view downcast writer, downcast dispatcher and possibly other parts of the data pipelines. However, there's a ceiling for how much can be achieved this way. So, unfortunately, making the data pipeline "live" is something we will have to do one day.

There's one thing to note, though. When we will introduce the "live" data pipeline, every action done in the editor will take twice as much time. As we add features, callbacks, post fixers, etc. to the editor and the number of things in the event loop rises, we may hit an efficiency issue from another side (for example, typing becomes less responsive).

There's a possibility that we may end up with two data pipelines: "live" and "on-demand" (that's the one we have now). We could either give the possibility to explicitly use one of them through configuration or through preparing appropriate editor build. But we could also try switching them as the document size grows.

If we hit performance issues with the "live" data view, we could also maybe think about batching changes and updating the data view less frequently than the editing view.

@scofalik scofalik added status:discussion type:task This issue reports a chore (non-production change) and other types of "todos". package:engine labels Nov 26, 2019
@scofalik scofalik added this to the next milestone Nov 26, 2019
@ma2ciek
Copy link
Contributor

ma2ciek commented Nov 26, 2019

There are some performance issues already reported in integration repositories that would benefit from this improvement:

@Mgsy Mgsy added the support:2 An issue reported by a commercially licensed client. label Dec 30, 2019
@Reinmar
Copy link
Member

Reinmar commented May 7, 2020

I've been thinking about data pipeline's performance on various occasions recently. Even if we'd greatly improve the performance of loading or retrieving the data to/from the editor, then there will always be content that long to become a problem. I know that some of our customers ask about 1000 pages long documents. Of course, we can go from e.g. 5s to 2.5s (as we did recently), but that'll still long.

What I was often recommending is loading just one chapter of such documents at once. This is probably the safest solution long-term.

But there's also a second, complementary option – don't use the data pipeline at all. Use the editor model directly. I'd guess that the 2.5s would become less than 100ms at that point.

The risk, though, is being tied to our data model's format. Also, breaking changes in the model structure would force data migration (there would be ways to automate this – e.g. use the data pipeline to migrate content via using the HTML format as the stable link between incompatible editor versions).


@fredck, I know that you worked on data snapshots performance for GitHub Writer. Could you share some insights here? Did you go with saving the data model's structure directly (JSON/XML) or something else?

@fredck
Copy link
Contributor

fredck commented May 7, 2020

@fredck, I know that you worked on data snapshots performance for GitHub Writer. Could you share some insights here? Did you go with saving the data model's structure directly (JSON/XML) or something else?

Yes, I've spent some good time on this issue and came out with a complete solution here:
https://github.com/ckeditor/github-writer/blob/master/src/app/plugins/livemodeldata.js

The code should be self-explanatory.

I save a snapshot of the model in a XML format for every single change made to the model. The time to retrieve the updated data is under 1ms.

You can observe the changes being saved (and the data format) in the "Application" tab of the dev tools, under "Session Storage" while using GitHub Writer.

@fredck
Copy link
Contributor

fredck commented May 7, 2020

Additionally, I've tried to implement a "Live View Data". I've even prototyped it and it could work well. Then I figured out that this is impossible to have because of the way features are implemented.

The problem is that features don't care about "attribute" and "remove" events in the data downcasting and therefore the "live view" would not catch such changes.

To make this feature a reality, all features should be reviewed and their data downcast should be as complete as the editing downcast, handling all kinds of changes.

I can have the prototype somewhere for you guys to check it, if you want. Just let me know.

@Reinmar
Copy link
Member

Reinmar commented May 8, 2020

Additionally, I've tried to implement a "Live View Data". I've even prototyped it and it could work well. Then I figured out that this is impossible to have because of the way features are implemented.

I guess that this is exactly what @scofalik proposed to do and exactly the issue that he predicted. Unfortunately, it seems like a lot of work to review existing features in this context. Also, unless we start using this for real, the effort will go nowhere as with time features will stop supporting it.

As for the live model data, I'm curious why you chose XML over JSON. JSON seems to be the standard for people today. Was readability/compactness a reason?

@fredck
Copy link
Contributor

fredck commented May 8, 2020

To live it documented here, this is the prototype code I came with for introducing editor.data.view and the "Live HTML Data Processor:
https://gist.github.com/fredck/a80c0b86ed53d3da3776c55ebb3e0e19

This worked very well while loading and inserting data in the editor (because features implement data downcasting for it). It stopped working when changing attributes and removing elements (because features don't implement data downcast for it).

I implemented this as plugins over the current editor code. I imagine that bringing it into the engine directly could make the above prototype even simpler and clearer.


Unfortunately, it seems like a lot of work to review existing features in this context.

Yes, definitely. But I believe it's doable.

Especially taking in consideration that the majority of the features share the same downcasting strategy for the editing and data pipelines. Which leads me to comment on the following:

When we will introduce the "live" data pipeline, every action done in the editor will take twice as much time.

Whenever possible (when editing and data strategy is the same), we could figure out a way to change both pipelines at the same time, instead of having them handled separately.

Ofc, we should start with the hard features first (tables, lists), to see if it would ever be possible.

Also, unless we start using this for real, the effort will go nowhere as with time features will stop supporting it.

Well, I assume this would ideally replace getData() altogether, so we would stick with it forever.


Summing up, this means a lot of work to be done. But I have a feeling that the outcome could be gorgeous and the benefits surreal ;)

@fredck
Copy link
Contributor

fredck commented May 9, 2020

As for the live model data, I'm curious why you chose XML over JSON. JSON seems to be the standard for people today. Was readability/compactness a reason?

JSON definitely sounds more modern, especially in the web/js world. But XML is also quite universal, although that's not the reason for the choice.

I had a few goals when designing this feature:

  1. Keep the live data in memory as a ready to use string (too complicated)... or an "almost string" - an array of strings that could be retrieved super fast with array.join().
  2. Updates to the above live data should be as simple as possible to avoid overhead during changes processing.
  3. The opposite direction, setting the editor with data produced by the live model data, should be as fast as possible.

I started with JSON in mind, but it ultimately failed for points 1 and 2. For point 3, both JSON (JSON.parse()) and XML (DOMParser()) do a very good job.

Let me try to explain why XML won over JSON for 1 and 2.


The in memory version of the data is an array tree (arrays inside arrays). In this way we can easily map the position of model nodes to the same position in the array tree. For example:

// Model:
<root>
    <blockQuote>
        <paragraph>
            text
        </paragraph>
    </blockquote>
</root>

// Live data array:
[ /* root /*
    [ /* blockQuote /*
        [ /* paragraph /*
            [ /* text /* ]
        ]
    ]
]

As we can see, the positions are a 1:1 match.

Then, we want to have an "almost string" representation of the data. With XML, this job is easy:

// Live data array:
[ '<root>',
    [ '<element name="blockQuote">',
        [ '<element name="paragraph">',
            [ '<text>', 'text', '</text>' ],
            '</element>'
        ],
        '</element>'
    ],
    '</root>'
]

Every node is represented by an array: [ '<opener>', ...children, '</closer>' ].

The above allows for simply calling liveDataArray.flat( Infinity ).join( '' ) and, voila, we have the data, super fast.


Now, trying to do the above with JSON is much more complicated:

// An empty element:
{ "type":"element", "name":"xyz" }
// or even this, if we want to help JSON:
{ "type":"element", "name":"xyz", "children":[] }

// Same element, 1 child:
{ "type":"element", "name":"xyz", "children":[ { "type":... } ] }

// Same element, 2 children:
{ "type":"element", "name":"xyz", "children":[ { "type":... },{ "type":... } ] }

The last case is the problematic one... do you see that comma between the children nodes? That messes things up.

Let me try to put the above in a "almost string" array, similar to the XML one:

// An empty element:
[ '{ "type":"element", "name":"xyz", "children":[', '] }' ]

// Same element, 1 child:
[ '{ "type":"element", "name":"xyz", "children":[', [ '{ "type":... }' ], ' ] }' ]

// Same element, 2 children:
[ '{ "type":"element", "name":"xyz", "children":[', [ '{ "type":... }', ',', '{ "type":... }', ' ] }' ]

Here goes the problem... the last option doesn't follow the [ '{ opener', ...children, 'closer}' ] format. It makes model->array position mapping more complicated and requires fixing things up when adding and removing sibling nodes.


So, instead of trying to go with a format that showed some complications, I've decided to go with the one that makes our life easier and our code faster :)

@fredck
Copy link
Contributor

fredck commented May 11, 2020

Wanted to share performance results for a comparison I've just made about how fast it is to have a string out of an in-memory representation of the model data. I've tested in-memory formats for the following outputs:

  • HTML: out of a DOM (document fragment) that reflects the model.
  • JSON: out of an object tree containing the definition of all model nodes.
  • XML: out of a "almost-string array" (my implemented strategy)

I've created the following code. It's enough to paste it in the console and run it. It'll use the page DOM as the "model", just for comparison purposes:
https://gist.github.com/fredck/f89ecaad2ba99d40c6b274efbf6369f0

I've run it in a monster page (gazeta.pl) and these are results output:

Firefox:

outerHTML time: 8ms
outerHTML length: 416242

JSON time: 10ms
JSON length: 613059

XML Array time: 11ms
XML Array length: 736540

Chrome:

outerHTML time: 11.237060546875ms
outerHTML length: 429060

JSON time: 14.77294921875ms
JSON length: 628073

XML Array time: 17.867919921875ms
XML Array length: 757506

Running multiple times in Firefox had more stable results. In Chrome, they changed a lot: sometimes outerHTML was super slow, other times XML Array was the fastest... and some times super slow as well.

Conclusion

Anyway, the above could tell us that all strategies are good and should perform well in the CKEditor case. The "XML Array" is surprisingly the slowest by a small bit.

This means that going with JSON is a better option as the code is less obscure, the output size is smaller and it performs similarly/better.

My implementation of the Live Model Data could easily be modified to output JSON. I'll probably do so.

@fredck
Copy link
Contributor

fredck commented May 11, 2020

Changing topic, I've got some other ideas about possible future enhancements to data in CKEditor:

  • Use native DOM for the model data, as well as for the views (editing, data) data:
    • It'll bring the whole DOM features available for free:
      • All the necessary tools to create elements, attributes, etc.
      • Access to cool stuff, like querySelector() (!). E.g. I want to retrieve all linked texts from the model: querySelector( 'text[linkHref]' ).
      • New differ on top of native MutationObserver... total freedom on the way of manipulating the model.
      • A dump of any of the data trees would be a innerHTML call away. Super fast and easy.
    • Reduce the amount of code to maintain.

This idea is inspired by what how smartly GitHub uses the web platform for its application. Indeed, if the features are there, why having a (poorer) custom implementation.

Ofc, maybe today its easier to imagine this working than it was we designed CKEditor 5 a few years ago. Maybe?

@jodator
Copy link
Contributor

jodator commented May 12, 2020

  • Use native DOM for the model data, as well as for the views (editing, data) data:

Wouldn't it make practically impossible to use model on Node? (this would be my first concern about it).

@Reinmar
Copy link
Member

Reinmar commented May 12, 2020

@jodator There's JSDOM. That should work on Node fine assuming that JSDOM supports all the features that we'd use. I know, for instance, that there was no mutation observer for a long time.

@fredck That's an interesting idea and it'd be interesting to research it one day. There'd be challenges, but there are some candies out there, for sure. BTW, the DOM can also be used for the view.

Thanks for sharing your thoughts on the performance aspect too. I'm sure we'll have to get back to that at some point. We found 2x speed-up so far and I guess there's like a 5x possible if we go deeper into some algorithms. However, for large documents we simply need a different approach – additional pipeline. It could either work around the view completely, or just use a live view structure, like the editing view.

@fredck
Copy link
Contributor

fredck commented May 13, 2020

FYI: the implementation of the Live Model Data has been changed to JSON:
https://github.com/ckeditor/github-writer/blob/master/src/app/plugins/livemodeldata.js

@urbanspr1nter
Copy link
Contributor

Hi -- apologies for bumping this up - but is there any work being done to try to improve getData performance? For our application, we use this heavily to be able to synchronize draft documents. It happens quite frequently (multiple times per second), but the size of our documents tend not to be large.

@scofalik
Copy link
Contributor Author

scofalik commented Jul 20, 2022

@urbanspr1nter At this moment I can only say that we are changing our opinion on storing the model as the document data source and we are more optimistic towards this idea. It would solve some of our problems, including efficiency problems when loading and getting data from the editor. Most probably this will be the direction that we will take. If or when it happens is too difficult to say, as we have to take a lot into consideration.

In last three years we have seen that the model "schema" is much more stable than we anticipated and we don't change nearly anything there. However, there's always risk that if you saved the model, it can become outdated. For example, recently we changed the name of image element to imageBlock and introduced imageInline. Fortunately, this is a simple change and backward compatibility could be easily introduced for such changes. Other risk is that you will remove some plugin from your config and an item in the model will not be recognized anymore.

However, these two problems do not exist if you store and pass model only for a duration of an editing session (it is not stored temporarily, or HTML is stored as well at some points and is used when the editor version or config changes). If your use case can implement safe model data usage, then I'd recommend giving it a try. Storing and loading model is much faster than the whole conversion process. Especially storing, it is fast and easy:

JSON.stringify( Array.from( editor.model.document.getRoot().getChildren() ) );

Voila!

For loading, you will need to recreate model from the JSON string and insert it into the root:

class ModelInit extends Plugin {
    init() {
        this.editor.data.on( 'init', ( evt ) => {
            evt.stop();

            const modelJson = JSON.parse( modelString );
            const modelStruct = makeModel( modelJson );

            this.editor.model.enqueueChange( { isUndoable: false }, writer => {
                const modelRoot = this.editor.model.document.getRoot();
                writer.insert( modelStruct, modelRoot, 0 );
            } );

            this.editor.data.fire( 'ready' );
        }, { priority: 'highest' } );
    }
}

function makeModel( childrenJson ) {
    const result = [];

    for ( const childJson of childrenJson ) {
        if ( childJson.data ) {
            const text = new Text( childJson.data, childJson.attributes || null );

            result.push( text );
        } else {
            const children = childJson.children && childJson.children.length > 0 ? makeModel( childJson.children ) : [];
            const element = new Element( childJson.name, childJson.attributes || null, children );

            result.push( element );
        }
    }

    return result;
}

Above shows how you could initialize the editor with a stringified model. If you just want to set the model data after you received it from external source, use this modified DataController#set() code:

const modelJson = JSON.parse( modelString );
const modelStruct = makeModel( modelJson );

this.editor.model.enqueueChange( { isUndoable: false }, writer => {
    const modelRoot = this.editor.model.document.getRoot();
    writer.remove( this.editor.model.createRangeIn( modelRoot ) );
    writer.insert( modelStruct, modelRoot, 0 );
} );

Finally, consider those two notes:

  1. This is not an officially supported code, just a proof of concept, which hasn't been researched, reviewed or tested. Although it looks like it makes sense, the testing and decision if you want to use it is on you.
  2. The editor was never intended to be able to return the data as frequently as in your case. What you try to achieve actually looks like a use case for real-time editing. Did you consider using this feature, it could make your UX much better.

@scofalik
Copy link
Contributor Author

scofalik commented Jul 20, 2022

You can also take a look at this solution: #620 (comment) However it was proposed 3 years ago and I am not sure if it still up-to-date. Also, AFAICS, you will not be able to get HTML from the editor this way (as DataController#stringify is overwritten).

@urbanspr1nter
Copy link
Contributor

@scofalik - Very interesting! Thanks for the insight. This is an interesting option .. but I'm also wondering if the view data from the editing pipeline can also be an option? Forgive me if this seems like a bad assumption, but I assume that getData really just the representation of the document using the data view pipeline. But what if let's say for our case, the editing and data views are not that much different. Would there be some performance benefit in just directly stringifying the editing view as opposed to getData?

@scofalik
Copy link
Contributor Author

So, you basically suggest to do editor.setData( editor.ui.getEditableElement().innerHTML )?

.innerHTML will be much faster than editor.getData() and I assume that also faster than stringifying the DOM.

However, I do not think this is a safe solution. Of course, it depends on what features you use. If you only use styling features, headings and lists then maybe that would work. Widgets (images, tables) will add UI elements that may be incorrectly converted. We also put some things (like hidden spaces) to fight browser quirks, and then you would have them in your content.

I would not recommend that. We may put really "weird" stuff into editable to provide UI or accessibility or to fight browser quirks.

Also, remember that editor.setData() will be slower than injecting model directly, as I suggested in my earlier post.

@urbanspr1nter
Copy link
Contributor

@scofalik - That's a good point. I did not consider the injected content. We actually keep things a bit simple and try to keep our editing view and data view more or less, the same. We don't make use of UI elements, however we do use the table and image plugins to a certain degree.

We also inject hidden spaces (ZWS) ourselves to manage the browser input behaviors. However, when we finally need to get the data, we generally do a global replace to remove them anyway. I assume that our operation in doing so is already probably doing that to whatever the library is also injecting.

That is very interesting though - I had never considered in leveraging the model as a way to serialize/deserialize the editor data. This would normally work for our use case, but we also fan out editor data to subscribers within our app so that they can make use of data too.

Also! Thank you for being so quick to respond and actually responding to such an old thread. It's really appreciated and I learned a lot just from reading your response. 😁 I'm going to revisit our implementation in how we serialize the editor data, and perhaps can come up with something creative there. I totally understand that getData was probably never really designed for near real-time retrieval.

@ThiefMaster
Copy link

I think some way to expose the HTML data in real time is a crucial feature. And since the issues with not having this only show up with large documents (guess what developers are unlikely to use / test with when integration CKEditor5? ;)) they tend to show up later, possibly even only in production. Basically at a time where finding a workaround is a mess (and waiting for upstream to improve things won't do any good in the short term).

Is there any reason why the editor doesn't have an option to always keep a HTML version internally? Since it probably already tracks all changes to its modal in order to show the editable view, it sounds pretty straightforward - for someone familiar with the CKEditor internals - to also keep an up to date HTML version that's in sync with the model, and could be used to populate e.g. a hidden field for more classic forms or the state of a react form library (e.g. react-final-form).

@hoqkhanh
Copy link

So, you basically suggest to do editor.setData( editor.ui.getEditableElement().innerHTML )?

.innerHTML will be much faster than editor.getData() and I assume that also faster than stringifying the DOM.

However, I do not think this is a safe solution. Of course, it depends on what features you use. If you only use styling features, headings and lists then maybe that would work. Widgets (images, tables) will add UI elements that may be incorrectly converted. We also put some things (like hidden spaces) to fight browser quirks, and then you would have them in your content.

I would not recommend that. We may put really "weird" stuff into editable to provide UI or accessibility or to fight browser quirks.

Also, remember that editor.setData() will be slower than injecting model directly, as I suggested in my earlier post.

@scofalik your solution solved my issue . But the cursor always is first of ckeditor field. Do you have any idea to keep the cursor at current position ?

I tried :
editor.focus();
editor.model.change((writer) => {
writer.setSelection(writer.createPositionAt(editor.model.document.getRoot(), "end"));
});
It seem not work

@scofalik
Copy link
Contributor Author

editor.model.change((writer) => {
	writer.setSelection(writer.createPositionAt(editor.model.document.getRoot(), "end"));
});

Should work correctly. I even checked it in a test sample. It correctly moves the selection. So, it must be something around this code. But this is an offtopic to this issue.

@lslowikowska lslowikowska added support:1 An issue reported by a commercially licensed client. and removed support:2 An issue reported by a commercially licensed client. labels Jul 24, 2023
@nfl404
Copy link

nfl404 commented Mar 23, 2024

那么,您基本上建议做什么editor.setData( editor.ui.getEditableElement().innerHTML )
.innerHTML``editor.getData()会比对DOM 进行字符串化要快得多,而且我认为也比字符串化要快。
但是,我认为这不是一个安全的解决方案。当然,这取决于您使用什么功能。如果您只使用样式功能、标题和列表,那么也许这会起作用。小部件(图像、表格)将添加可能被错误转换的 UI 元素。我们还添加了一些东西(例如隐藏空间)来应对浏览器的怪癖,然后您会将它们添加到您的内容中。
我不建议这样做。我们可能会将真正“奇怪”的东西放入可编辑中,以提供 UI 或可访问性或对抗浏览器怪癖。
另外,请记住,这editor.setData()比直接注入模型要慢,正如我在之前的文章中建议的那样。

@scofalik你的解决方案解决了我的问题。但光标始终位于 ckeditor 字段的第一个位置。您有什么想法将光标保持在当前位置吗?

我尝试过: editor.focus(); editor.model.change((writer) => { writer.setSelection(writer.createPositionAt(editor.model.document.getRoot(), "end")); }); 看来不行

It is a better solution to replace editor.getData() with editor.ui.getEditableElement().innerHTML.

@scofalik
Copy link
Contributor Author

scofalik commented Apr 3, 2024

I would be cautious and advise against using editor.ui.getEditableElement().innerHTML as it does not contain the same data as editor.getData(). Some elements are displayed differently in the editing field than in the document data. There is a risk that you may lose some data after saving innerHTML an then loading it in the editor.

@scofalik
Copy link
Contributor Author

scofalik commented Dec 19, 2024

In recent releases, we worked on performance improvements in engine, mostly from the point of view of editor init time. More on that in recent changelog entry. In summary -- we lowered editor init time by around 65-80%, depending on the data sample we had.

The good news is that these changes also have a positive impact on editor.getData().

Worth mentioning -- these changes are not brekaing and do not change the editor architecture, or how the core engine parts work. More precisely, the data is still evaluated "from scratch" when editor.getData() is used.

We still have one big improvement ahead -- scheduled for February release.

As we measure on our test samples, together with the improvement mentioned above, we see 70-90% lower editor.getData() times. In other words, these improvements lower time execution of editor.getData() 10 times. This depends on several factor, like amount of formatting, used features, etc. (loading and saving time is proportional to the number of tags in the data, rather than amount of text).

We hope that these changes scheduled for February will solve many problems that you might experience so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
package:engine status:discussion support:1 An issue reported by a commercially licensed client. type:task This issue reports a chore (non-production change) and other types of "todos".
Projects
None yet
Development

No branches or pull requests