-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove File/Image links in strip_code() #194
base: main
Are you sure you want to change the base?
Conversation
The problem with appveyor is not related to changes in the code. |
Thanks, I'll look at the appveyor issue. Unfortunately, I'm not sure about this. Now we need to maintain this list for every language, which introduces an additional maintenance burden. (This is one of my problems with the original issue.) I would prefer to have a way to retrieve this dynamically for the current wiki. This means mwparserfromhell needs a way to learn the context of the text its parsing and needs to grow functions to retrieve configuration from a live wiki (or some way for the user to provide it without requiring an annoying amount of boilerplate: maybe predefined config for known wikis and a way to generate/cache it for unknown wikis?). This kind of thing has been largely out-of-scope for mwparserfromhell so far, but I'm willing to accept it as a feature if we can come up with a clean implementation. Of course, that's a much more involved task than what you've done here. If we stick with this solution, there are some edge cases not covered. For example, [[Imagem:Foo]] on enwiki is a normal link, but this will strip it out, which isn't the correct behavior. You might argue that is an unlikely case, but it makes me a bit uncomfortable. Unintentionally, I think, it will also strip out just "[[ImagemFoo]]", which should be easy to fix. It might be a good idea to make this a configurable feature of strip_code, using kwargs. Since image captions are arguably part of the article text, I would be sad if we didn't have a way to preserve them. For inspiration, see how Template.__strip__ handles this. However, this means we need to parse out things like thumbnail options and potentially alt text, which I've avoided until now. We also need some tests in TestWikilink.test_strip. More stylistic things:
|
If we end up using a context object based on the API results, hardcoding a list of namespace names per language is not necessary, because the English prefixes always work and the translations can be deduced from these "canonical" names. See the relevant API query. I'm totally against doing any API queries directly from mwparserfromhell. Besides the unnecessary code bloat, that would also make it difficult to integrate mwparserfromhell into projects which would like to handle all queries by themselves. I think a nice solution is the introduction of intermediate context classes where users would supply the necessary information. Common wikis like enwiki can have an instance of the context provided by mwparserfomhell. Also, since we're talking about parsing MediaWiki titles, you might be interested in my helper Title class. That's probably the most simple case of context-sensitive parsing of MediaWiki things, but still there are some problems (e.g. lahwaacz/wiki-scripts#38). |
I agree with @lahwaacz, that it would be more comfortable to take localized names from API instead of hardcoding them. We could do something like this: # returns a dict of 'en'->'hi' translations
specials = mwparserfromhell.getLocalizedSpecials(lang='hi')
strip_specials = [specials[i] for i in ['File', 'Image', 'Media']]
wikicode = mwparserfromhell.parse(text)
s = wikicode.strip_code(strip_specials=strip_specials)
|
I would recommend hardcoding the English versions (Image, File) in mwparserfromhell, and then having a setter method so frameworks/scripts can add in localized versions from the API. I agree that mwph should not be calling the API, or at least, if we decide that mwph should be wiki-config-specific like Parsoid, then it should probably be done separately, with a little more thought about the architecture. mwparserfromhell.parser.add_file_localization([...])
mwparserfromhell.parser.add_ns_localization({'file': [...]}) |
If I may... (Disclaimer: I am currently switching with a huge wikiparsing-related project to Python from Ruby. In Ruby, I developed wikitext parser myself from scratch, so I am kinda aware of some problems and possible solutions :) My library was built on top of MediaWiki API client, so for me the problem was simpler, all the context that is necessary for proper parsing was fetched with the page. But, in the case of parsing-only library like mwparserfromhell, I'd do this:
mwparserfromhell.parse(wiki, namespaces={'File': ['Зображення']})
If you'd like, I can try to prepare the PR (though I am not sure about my Python-fu still, I started the transition not long ago). |
Closes issue #136