Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add naming spike #18

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add naming spike #18

wants to merge 1 commit into from

Conversation

iHiD
Copy link
Member

@iHiD iHiD commented Apr 10, 2021

This is a messy silly spike. The code is not important, and will never get merged, but the idea is interesting.

Also - this is entirely not urgent or important, but as writing CSS 24/7 is killing me, and its important to be confident of our specs before launch, I felt like spending an hour or two reasoning about this using code


This explores the idea of rewriting a solution to standardise variable naming to allow for deeper normalisation.

Problem

It is essential that when a mentor gives feedback on a representer solution that they are referencing identifiers (e.g. variable names) that can be replaced with other variable names in other solutions.

Right now the fact that someone can reuse an identifier (e.g. variable name) in different contexts is problematic for us.

Take these three submissions:

# Submission 1
def add_2(a); a + 2; end
def add_3(a); a + 3; end
# Submission 2
def add_2(b); b + 2; end
def add_3(b); b + 3; end
# Submission 3
def add_2(a); a + 2; end
def add_3(b); b + 3; end

If we replace a in the first solution with PLACEHOLDER_1 and b in the second with PLACEHOLDER_1 we can see that they are identical. However in submission 3 because both a and b are used, this naive replacement means they get different representations.

Possible Solution

Rather than relying on the identifiers given by the student we could instead rename the variables ourselves internally. This would give us something like this:

# Normalised
def add_2(add_2_0); add_2_0 + 2; end
def add_3(add_3_0); add_3_0 + 3; end

All three submissions are normalised in the same way so their representations are identical. This branch achieves this (see test/naming_normalizers/method_locals.rb for tests):

This still gives us a problem though that we show the original code to the mentor to give feedback on. If they gave feedback on submission 3, this would be fine, but on submissions 1 or 2, the a or b would be ambiguous. The rule to guard against this is that all mapping.json files must be reversible (ie we cannot have identical keys or values).

To fix this, we could return our normalised solution as a new normalized.rb file with a mapping against this file, and show this normalised code to the mentor (possibly with a tab to see original code too in case the modified identifiers make it hard to read). Regardless of which of the three submissions are used, things would then map correctly.


  • Note 1: This spike only works with very limited situation atm (local variables inside methods). However, I think this acutally fixes the vast majority of problems.
  • Note 2: Choosing readable identifiers will be important
  • Note 3: Rewriting code is relatively easy in Ruby and in many other languages. However, there are some that it would be hard in. For those, as long as we can get line/character numbers (which seems to be pretty universal) then we can text-replace things, making this also feasible.
  • Note 4: If we do this, it'll likely be post-v3 (I might try and do it before because its only a couple of hours work on the website-end and because updating representers comes with pain for mentors who need to then recheck their feedback is correct, but I'll see how things play out)
  • Note 5: This will be an enhancement to the existing spec. Old representers will continue to work.
  • Note 6: This is basically impossible to do accurately in Ruby. The moment someone uses certain metaprogramming techniques, this will fall apart. We need to determine if those techniques mean we may give out the wrong feedback, or if they just always make their solutions unique. We can mitigate against either by guarding against certain techniques (such as send and define_method and either falling back to a more naive implementation or returning an error on those.
  • Note 7: Thanks to @mpizenberg and @SleeplessByte for the conversations that lead me to spike this.

@iHiD iHiD requested a review from a team as a code owner April 10, 2021 23:33
@mpizenberg
Copy link
Member

I find the concept of representers really interesting because there is so much space between "human interpretation" and "machine interpretation" of one code. Different local names like in this PR is one such space freedom. But the general problem of computing representation is where do we place the cursor between "exactly the same code" and "exactly the same code execution".

The more we say that a normalization is fine if it results in the same code execution, the more liberty we have to change things and thus to have more generic representations. This may be interesting in the case of changes that have combinatorial effects on the number of representations. One such thing is order. May it be order of declarations, variables, etc. For example:

# Submission 1
def add_2(a); a + 2; end
def add_3(a); a + 3; end
# Submission 2
def add_3(a); a + 3; end
def add_2(a); a + 2; end

Those are also fundamentally the same.

I believe each (valid) normalization brings us closer to "same code execution" and further from the original solution. I'm wondering if tackling this problem could be done with some kind of representation levels. Let's say we have three levels in the ruby representer for example.

  1. level 1: remove comments, normalize spaces and indentation
  2. level 2: normalize names in a strictly reversible way (there is a bijection), normalize order of definitions (in a correct way)
  3. level 3: normalization of names with local scope (not reversible anymore)

Let's say an exercise has 1000 solutions. Maybe there will have 200 different level 1 representations, 50 different level 2 and 10 different level 3 representations. So provided adequate feedback on level 3 representations could reach a lot of people, while providing feedback on level 1 would mean a lot more personalized feedback.

Levels 3 (or whatever higher representation) could also be interesting to analyze the global patterns used to solve an exercise so might be interesting for grouping solutions by categories when browsing solutions of other people.

That's just my 2 cents on the problem of representations ^^.

@joshiraez
Copy link

joshiraez commented Apr 12, 2021

I'm not sure if I follow. We need the represented for two things, right:

  • Normalization to provide general insights on the code
  • Readability so the mentor can tackle it easily

And I'm not sure how the represented is built, but I kinda thought that it returned a Data Structure consisting of tokens. Then that data structure is parsed to whatever the current objective is at that time.

If that's the case, why not store the identifiers in a scope table, and return the token in the form of a pair [PLACEHOLDER_ID, NAME] with the PLACEHOLDER_ID the placeholder given for that context, and the NAME the name the var had before?

I think it goes in the line of the idea of having a step where we can do bijection mappings, but what I'm lost is why we have to do these kind of decisions before presentation - we can use these pair tokens for anything but showing them to the mentor, and there the conversion should be fairly simple. I'm talking from the unknown here, because I don't know exactly what the represented is or what its features are, but rather what I'd do with this kind of problem in a general way

As an example the given code would become this.

Submission 1

def add_2([ID_1, 'a']); [ID_1, 'a'] + 2; end  
def add_3([ID_1, 'a']); [ID_2, 'a'] + 3; end  

Submission 2

def add_2([ID_1, 'b']); [ID_1, 'b'] + 2; end 
def add_3([ID_1, 'b']); [ID_1, 'b'] + 3; end

Submission 3

def add_2([ID_1, 'a']); [ID_1, 'a'] + 2; end
def add_3([ID_1, 'b']); [ID_1, 'b'] + 3; end 

We could even make them valid string identifiers: [ID_1, 'a'] could become ID_1___a, I'm unsure how this is actually implemented.

@ErikSchierboom
Copy link
Member

And I'm not sure how the represented is built

See https://github.com/exercism/docs/blob/main/building/tooling/representers/interface.md for a description.

If that's the case, why not store the identifiers in a scope table, and return the token in the form of a pair [PLACEHOLDER_ID, NAME] with the PLACEHOLDER_ID the placeholder given for that context, and the NAME the name the var had before?

We store the name mapping in a mapping.json file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants