-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Japanese kanjis in CSV header automatically replaced into Chinese Pinyins #145
Comments
The application of
Why is |
It's being used because these are used as postgres column names in the datastore database, which can be difficult to access for users not used to entering non-ascii text via the API. For preview it is possible to add the original names to the data dictionary, but that's an extra step right now. We should experiment with not converting these column headings, or only filtering specific characters known to cause problems. Have you tried removing this conversion to see if these column names cause any problems? |
Thanks @wardi , yes I have experimented with removing |
what are some examples of this difficulty in access?
for the test resources we are using, there are no issues. However, there could be issues for other characters not yet encountered. Is there a test dataset that has been used previously (such as that would have lead to the decision of using |
The difficulty I'm thinking of is for people without input methods set up to enter non-ascii characters reliably, and the ambiguity of some Unicode characters. It's not a blocker just a possible impediment. If we update our installation docs to make sure that the databases are created with a UTF-8 encoding then we should be able to use any language in the column names that postgres supports and I think removing For testing names we have to follow postgres restrictions for column naming and be aware of case-folding behavior. |
We can make a PR to make the
The documentation does mention and provide steps to use UTF-8 encoding, so that is taken care of.
afaik the postgres naming restrictions are already being followed, and the case-folding behavior won't change with the proposed PR. |
@aimalkhan sounds good, but we'll need to check that the column name will be valid as a postgres column name (at a minimum no @jqnatividad you did some work on code that validated column names for datapusher+ recently, right? Maybe we can borrow that. @aimalkhan let's also disable unidecode by default instead of enabling it, and include a note in the changes or readme about this. |
Yes @wardi, in DP+, especially since it supports spreadsheets, we found a lot of users were uploading worksheets with column header names that were not valid PostgreSQL column identifiers. That's why I created qsv's It guarantees that it will only create "safe names" - sanitizing column header names so they are not only valid Postgres identifiers, it will also take into account additional CKAN restrictions - (columns that start with "_" and the special "_id" column), and ensure duplicate column names are handled as well. Further, DP+ adds the original name of the column to the Data Dictionary as the column labels. The labels are used automatically when viewing the resource using DataTables view instead of the sanitized name - in that way, you still get the original header names when viewing the dataset, but the "safe", machine-readable header names are the ones used when using the Datastore API or exporting the data. While we're on the topic of aliases - DP+ also has an "auto-alias" option - automatically creating a human-readable resource alias (by default - @aimalkhan , if you're so inclined, I encourage you to try out DP+ as I'd love to get it exercised with more languages. I took care to make sure it supports UTF-8 properly, and I have some folks in Europe, China and Japan testing it and I'd love to see more. FYI, while we were at OpenGov, we sponsored xloader, as the original datapusher was the largest source of post-CKAN installation issues. DP+ combines the best of datapusher and xloader since then, and I'd be interested in your feedback. |
@jqnatividad for the alphanumeric test are you using the unicode categories |
@wardi I have no specific test for those unicode categories, but since Rust strings are UTF-8, I "think" it's something we get for free. Do you have specific test cases I can try to check out support for those specific categories? |
When I uploaded a CSV file on a datastore via XLoader, I noticed that all Japaneses kanjis in the CSV header are replaced into Chinese Pinyins.
For example, when a XLoader loads such CSV file, we get this table.
Note: The pronunciation of
緯度
is/ido/
and the pronunciation of経度
is/ke:do/
in Japanese. So you can notice these are quite far from the Chinese one.But I want to get a table like this. So, those are should not be replaced.
I know
unidecode
does that in this function. But I don't have good idea not to encode Kanjis in a CSV header.The text was updated successfully, but these errors were encountered: