-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New example: support a more wide range of unicode identifiers #13
Comments
This is definitely an interesting example. I would likely use something like Line 68 in 72d84ec
It might require to patch some builtins like To properly demonstrate this, I would likely need to "fix" the console so that it works "properly" with AST transformations: currently, one requires to use = = = |
There are more efficient ways, of course, but this looks fine for an example.
Yes, I think so. For an industry-grade solution I would expect that the inspect module should be also altered, e.g. inspect.signature(). Maybe something else from the stdlib. But for an example - dir() is enough.
If it's a good idea, in your view - I'll try to implement this. |
Please feel free to go ahead. I'm thinking of a potentially "simpler" approach where the source is transformed at the tokenization stage, so that no AST transformation would be required; I think I could make this work ... but there might be some advantages to doing AST transformations that I can no see due to my lack of knowledge. My mind has been coming back to this idea while doing some other work; I definitely find this an interesting example. |
I'm not sure how robust it could be. But the tokenize() does preserve "disallowed" unicode symbols, as I noted before. |
Probably, you were right: after some playing with the token-based approach, I don't think it will break things. And it's simple, indeed: import io
import tokenize
import unicodedata
import uuid
from ideas import import_hook
_NAMES_MAP = {}
def fix_names(source, **kwargs):
result = []
g = tokenize.tokenize(io.BytesIO(source.encode()).readline)
for toknum, tokval, _, _, _ in g:
if toknum == tokenize.NAME:
if unicodedata.normalize('NFKC', tokval) != tokval:
if tokval not in _NAMES_MAP:
_NAMES_MAP[tokval] = f'_{uuid.uuid4().hex!s}'
tokval = _NAMES_MAP[tokval]
result.append((toknum, tokval))
return tokenize.untokenize(result).decode()
def source_init():
return """
old_dir = dir
def dir(obj):
result = old_dir(obj)
for k, v in _NAMES_MAP.items():
result = [_.replace(v, k) for _ in result]
return sorted(result)
"""
import_hook.create_hook(source_init=source_init, transform_source=fix_names) Session example:
(BTW, maybe locals() should be added to the console_dict per default?) |
Very nice! Your suggestion of adding locals() by default makes sense; this is essentially what I do with another project (friendly). In addition to modifying dir(), one would probably need to modify vars(), and perhaps locals() and globals(). Tracebacks might be tricky to decipher unless they are decoded as well. I'm thinking that this example should be included as "extended_unicode". I'll try to do this tomorrow and perhaps writing a blog post about it, giving you full credit for the idea and implementation. I like it: it is very much in the spirit of what I had in mind when I created this project. |
Maybe. Or we can just mention other pitfails of this approach: this is an example, right? But
I've added this transformer as the unicode_identifiers() function, because it allows us any unicode string as an identifier (again, probably this is not a good idea for a professional code: perhaps, Julia-like normalization is more suitable for math). But I'm not good in naming, anyway.
Thank you. I was planning to finish a PR for this, but if you have time to do this yourself (better naming, |
I've uploaded a new version to PyPI. I made a few relatively minor changes to your code.
Here's a sample session with the new code. >>> from ideas.examples import unnormalized_unicode
>>> from ideas.console import start
>>> start()
Configuration values for the console:
console_dict: {'__NAMES_MAP': {}, 'ndir': <function ndir at 0x018CAD20>}
transform_source: <function fix_names at 0x0167C540>
--------------------------------------------------
Ideas Console version 0.0.20. [Python version: 3.7.8]
~>> ℕ = 1
~>> N = 2
~>> ℕ
1
~>> dir()
['N', '_8dab3ef5fc2949e992deda99acbfb037', '__NAMES_MAP', '__builtins__', 'ndir']
~>> ndir()
['N', '__NAMES_MAP', '__builtins__', 'ndir', 'ℕ']
~>> class A:
... ℕ = 1
... N = 2
...
~>> def interesting(names):
... return [n for n in names if not n.startswith("__")]
...
~>> interesting(dir(A))
['N', '_8dab3ef5fc2949e992deda99acbfb037']
~>> interesting(ndir(A))
['N', 'ℕ'] As I mentioned, I still have to write documentation for it (and probably a blog post), but that will have to wait for a bit as I want to think some more and see if this could not be improved further. There is still the idea of passing locals() to the console which I need to think about... |
Just an additional thought I had ... for easier comparison, I think that the new names should start with the normalized name followed by an underscore and the uuid (which could be probably truncated a bit). So, in the example above, |
New version uploaded with last mentioned change implemented. Here's a snipped showing the result: ~>> ℕormal = 3
~>> ndir()
['__NAMES_MAP', '__builtins__', 'ndir', 'ℕormal']
~>> dir()
['Normal_98a22058a2aa4b31a900c8b215ea09c5', '__NAMES_MAP', '__builtins__', 'ndir']
~>> ℕormal
3 |
Playing with yet a new version (not uploaded to pypi). ideas' version number shown here has not yet been changed to reflect the latest version. >>> from ideas.examples import unnormalized_unicode
>>> from ideas.console import start
>>> start()
Configuration values for the console:
console_dict: {'__NAMES_MAP': {}, 'ndir': <function ndir at 0x018EACD8>}
source_init: <function source_init at 0x018EAD20>
transform_source: <function transform_names at 0x00ADC540>
--------------------------------------------------
Ideas Console version 0.0.21. [Python version: 3.7.8]
~>> dir()
['__builtins__']
~>> ℕ = 1
~>> dir()
['__builtins__', 'ℕ']
~>> true_dir()
['N_fa969ab27247436c8c5350151e606b57', '__NAMES_MAP', '__builtins__', 'dir', 'true_dir'] |
I would prefer to fix the "patched" dir(), if possible. It shouldn't match the original dir() behaviour exactly. But the interface should be same, i.e. the no-args version of dir().
Yes, this looks better. Maybe you can add some common prefix/suffix to simplify filtering/selecting such variables. BTW, I'm planning to use the ideas (Ideas Console or just import_hook) for the Diofant's command-line interface (diofant/diofant#853). Will you, eventually, factor-out the import_hook/console stuff to some separate library/ies? |
I believe it I got it to work now as similarly as possible compared with the original dir().
One possibility I thought of was to automatically exclude variables that start with a double underscore as these are often methods that are of no interest. For example: ~>> class A:
... ℕ = 1
...
~>> dir(A)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'ℕ'] I suspect that all the so-called magic method, that is those that do start and end with double underscores, would be of no interest to most "casual" users or users of projects with their own consoles, like Diofant. Would such a filtering be useful for your project? Fro this example, I am thinking of keeping the original dir around, under another name.
I was not planning to do any such factoring out. In my mind, ideas is very much a toy project used to explore different possibilities of changing the way Python works. For that purpose, I believe it is important to include all the examples. I did not expect it to be found useful in any real-life project; however, I can see how this could be the case for Diofant. One thing I can do that might be useful is to filter out the original message about configuration values for the console by default, and only show them with something like >>> start()
Configuration values for the console:
console_dict: {'__NAMES_MAP': {}, 'ndir': <function ndir at 0x01ABACD8>}
source_init: <function source_init at 0x01ABAD20>
transform_source: <function transform_names at 0x0137C540>
--------------------------------------------------
Ideas Console version 0.0.21. [Python version: 3.7.8] I can also change it so that the message shown with the name of the console, its version, etc., is easily configurable; something like I should be able to do this today and release a new version with these changes. Finally, for AST transformations, the repl does not "echo" back the value of names or the value of statements without an explicit print statement. I would think that fixing this would be useful for projects like Diofant. I think there might be a way of making this work for simples cases where one just wants to see the value of a variable, but I don't currently know how to have it reproduce the exact behaviour of the Python's REPL. What would be needed for Diofant? |
As it turns out, Python 3.9+ includes an unparse function in the ast module. This makes it possible to do AST transformations, transform them back into valid/normal Python code, and use the usual way to compile a source in the interactive interpreter, so that it is not needed to use print() to see the output. >>> from ideas.examples import fractions_ast
>>> hook = fractions_ast.add_hook()
>>> from ideas.console import start
>>> start()
Ideas Console version 0.0.21. [Python version: 3.9.5]
~>> 1/2
Fraction(1, 2)
~>> a = _
~>> a
Fraction(1, 2) I noticed that Diofant requires Python 3.9 ... which is perfect for this. I have uploaded a new version to Pypi which includes this change. This new version also includes the other changes mentioned for the console (hiding the configuration values, configurable prompt and banner), etc. However, it is not fully tested, but the quick interactive tests I did all worked as I expected them to. I will definitely need to update the documentation to reflect all of these changes. |
That will be a different dir().
I don't think so.
Not really. There is a very early version, that included as POC and not exposed so far for end users (ex. with a CLI option).
Maybe. But I don't use start() interface. Instead, I do subclass the IdeasConsole.
I'm not sure I understand you. Everything seems to be working for the DiofantConsole:
|
UPD:
I think I got it. This happens, for example, for the AutomaticSymbols() ast transformer. C.f. the IPython session:
and the IdeasConsole:
FYI: the given ast transformer does the following transformation in this case:
I did the following workaround: --- console.py.orig 2021-06-16 08:20:05.669683618 +0300
+++ console.py 2021-06-16 08:20:13.853350692 +0300
@@ -138,6 +138,8 @@
if hasattr(ast, 'unparse'):
try:
source = ast.unparse(tree)
+ source = source.split("\n")
+ source = ";".join(source)
except RecursionError:
code_obj = compile(tree, filename, "exec")
else:
|
Python do NKFC-normalization while parsing identifiers. That disallow some fancy unicode identifiers like ℕ (it will be N for Python), see e.g. this. Other languages, that support unicode identifiers usually lack this "feature" and/or use different normalization, like Julia. E.g. the Scheme:
It's possible to "patch" this unfortunate feature with transform_source-based transformation: parse source to ast tree, then "fix" normalized identifiers, using lineno/col_offset/etc into something like
N_1
, instead ofℕ
in the original source. This might look tricky, but I think this will fit nicely into your collection of examples: it combines ast parsing and some parsing of the original source string (i.e. with tokenize) to get disallowed symbols back.The text was updated successfully, but these errors were encountered: