Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce more GREL expressions #175

Open
bencomp opened this issue Sep 20, 2023 · 0 comments
Open

Introduce more GREL expressions #175

bencomp opened this issue Sep 20, 2023 · 0 comments

Comments

@bencomp
Copy link
Contributor

bencomp commented Sep 20, 2023

How could the content be improved?

This is a suggestion from #29 to incorporate more about the General Refine Expression Language (GREL) in the lesson.
Several specific suggestions have been made in response, but no concensus has been reached on which specific suggestions should be implemented.

Discussion from #29

The original discussion from the 2018 CAC included:

The Library OpenRefine uses a lot of GREL expressions, reformatting dates, reformatting names. People say that’s where they tend to spend a lot of their time.

In my first comment on the issue, I wrote:

More GREL: yes! GREL is of course the way to transform the data. I wonder if GREL should be introduced with simpler examples than .replace on strings, like incrementing numbers (value + 1) or combining strings to create URLs ("https://example.org/" + value).

@ostephens suggested (copied from a Slack discussion) to my question in #29 (comment):

What kind of GREL expressions should be added to the lesson?

Assuming the current dataset, the cells with lists in look good for some GREL examples. So for instance the "items_owned" column can be manipulated using GREL to give a count of the most common items that are owned (mobile phones and radios just ahead of ploughs).
The current format of those lists makes the GREL slightly complicated to get a clean list and done correctly I think a series of steps that goes through the process of 'cleaning' this column could be provide a really good set of learning materials - one of the great things about OpenRefine is that ability to get real time feedback on changes as you work with the data.
OTOH if a more accessible example is needed the data set could be updated to simplify those lists to be just semi-colon separated which would make the process much simpler.

Another GREL example that would work with the current dataset would be the formatting of the "interview_date" column which is currently in dd-MMM-yyyy (vs the start and end columns which use ISO-8601). So something like:
value.toDate("dd-MMM-yy").toString("yyyy-MM-dd")
could provide a good example.
And give an opportunity to more generally talk about Date manipulation in OR (I would have guessed that date issues might come up commonly in social science datasets - but I may be wrong as not my area)

@ndporter suggested in #29 (comment):

One comment from teaching this recently with the list of items column - the lesson uses GREL to facet by subsets of the column but doesn't demonstrate how to change that column to something more usable (such as dummy variables for each category of item once they're cleaned). As a bonus, parsing it to columns also highlights for learners the difference between cell transforms, multi-valued cell splits, and column splits.

All of that said, adding more GREL is also tricky when learners don't have programming experience because chaining functions can rapidly become confusing to novice coders.

To which I responded in #29 (comment):

As to your comment, @ndporter: the idea of using OpenRefine to create dummy variables from the items column had not yet crossed my mind. I like it. After trying and going through the manual and StackOverflow for a little bit, I think it is doable, but not in this workshop. It requires exporting the ID and items columns, doing the transformation in a new project and then importing the new columns (crossing them one by one, potentially) into the project. That is madness. Perhaps there are easier ways using column splitting, but I guess the current exercise of splitting to count is good enough. I'm open to other suggestions for introducing more GREL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant