Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing xml input files #139

Open
yashkens opened this issue Dec 22, 2020 · 1 comment
Open

Processing xml input files #139

yashkens opened this issue Dec 22, 2020 · 1 comment

Comments

@yashkens
Copy link

Hi,
I am using ufal.udpipe for Python for processing files encoded in xml format. There is an example part of my input file:

<p>Звери</p>

<p>***</p>
<p>Посадили утку в курятник. Она ходит себе, оглядывается. Потом спрашивает:</p>
<p>-- Господа, а где здесь пруд?</p>
<p>А сверху, с насеста:</p>
<p>-- Мадам, здесь не прут, здесь топчут!</p>

Is it possible to store xml tags in SpaceAfter and SpaceBefore fields? The manual mentions:

Note that in theory not only spaces, but also other original content can be saved in this way (for example XML tags if the input was encoded in a XML file).

Unfortunately, I couldn't find any directions on how to do that. Please, can anyone give any suggestions?

Thanks in advance!

@foxik
Copy link
Member

foxik commented Dec 22, 2020

Currently there is no tokenizer capable of doing it automatically (i.e., you would need to parse the XML, pass only the raw text through UDPipe tokenizer and then manually add the tags to SpacesBefore/SpacesAfter). The functionality is planned for UDPipe 3 where we overhaul the tokenization completely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants