Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

platform(actions): create automated action to generate emoji.json inline with unicode.org #226

Open
wax911 opened this issue May 23, 2024 · 3 comments

Comments

@wax911
Copy link
Member

wax911 commented May 23, 2024

AniTrend Issue Guidelines

Before opening a new issue, please take a moment to review our community
guidelines
to make the
contribution process easy and effective for everyone involved.

You may find an answer in already closed issues:
https://github.com/AniTrend/android-emojify/issues?q=is%3Aissue+is%3Aclosed

Feature Information

https://home.unicode.org/ often provides new Unicode versions, see: https://unicode.org/Public/emoji/ where we can find all the Unicode versions released. Ideally, we’d love to generate a emoji.json file based on the latest Unicode standard and make an automated PR for it.

Solution Information

  1. Fetch Unicode Emoji Test Data:

    • The script retrieves data from https://unicode.org/Public/emoji/latest/emoji-test.txt.
    • Extracts the latest Unicode Emoji version from the file's header.
  2. Fetch Emojibase Data:

    • The script constructs the URL https://cdn.jsdelivr.net/npm/emojibase-data@<latest_version>/en/data.json.
    • Downloads and parses the Emojibase data to create a mapping of emoji characters to their shortcodes (aliases).
  3. Process Emoji Data:

    • Extracts relevant details from the Unicode Emoji test file, such as description, status, version, and Unicode representations.
    • Maps aliases from the Emojibase data to each emoji. If no aliases are found, they are generated from the emoji’s description.
    • Filter out non fully-qualified emoji formats
  4. Save to File:

    • Outputs the processed emoji data as a JSON file (emoji.json).

Proposed GitHub Workflow

To automate the process of generating and submitting a PR for emoji.json:

  1. Schedule Workflow:

    • Configure a GitHub Actions workflow to run weekly or on-demand.
  2. Fetch and Compare Emoji Versions:

    • Use the script to fetch the latest Unicode Emoji version.
    • Compare it with the currently processed version in the repository.
  3. Generate Updated emoji.json:

    • If a new version is available, run the script to generate an updated emoji.json.
  4. Automate PR Creation:

    • Commit the updated emoji.json.
    • Push changes to a new branch and open a PR with a description of the changes.
  5. Notify Reviewers:

    • Tag maintainers or reviewers to ensure the PR is reviewed and merged.

Example GitHub Action

name: Update Emoji JSON

on:
  schedule:
    - cron: "0 0 * * 0"  # Run weekly
  workflow_dispatch:

jobs:
  update-emoji-json:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout Repository
      uses: actions/checkout@v3

    - name: Setup Python
      uses: actions/setup-python@v4
      with:
        python-version: "3.9"

    - name: Install Dependencies
      run: |
        python -m pip install --upgrade pip
        pip install requests

    - name: Run Emoji Update Script
      run: python app.py

    - name: Commit Changes
      run: |
        git config user.name "github-actions[bot]"
        git config user.email "github-actions[bot]@users.noreply.github.com"
        git checkout -b update-emoji-json
        git add emoji.json
        git commit -m "chore: update emoji.json to latest Unicode version"

    - name: Push Changes and Create PR
      uses: peter-evans/create-pull-request@v4
      with:
        branch: update-emoji-json
        title: "Update emoji.json to latest Unicode version"
        body: "This PR updates emoji.json to the latest Unicode standard."

Additional Context

TBA

@DanyloOliinykSSG
Copy link

As continuation of #293

Would you like to keep the same json structure as it is rn? So we need to create a script that will parse https://unicode.org/Public/emoji/latest/emoji-sequences.txt into a json, right?

Is this the json structure library is awaiting for?

{
  "emojiChar": "string",           // The emoji character
  "emoji": "string",                   // Probably we can remove this? As it duplicates emojiChar
  "description": "string",         // Description of the emoji
  "aliases": ["string", ...],        // Alternative names for the emoji
  "tags": ["string", ...],            // Associated tags for categorization
  "unicode": "string",              // Unicode representation of the emoji
  "htmlDec": "string",             // HTML decimal code
  "htmlHex": "string",             // HTML hexadecimal code
  "supports_fitzpatrick": boolean  // (Optional) Indicates Fitzpatrick scale support for skin tones
}

Here's a **draft** script for parsing data from unicode into a json (via ChatGPT, so it might be not accurate):

#!/usr/bin/env python3

import json
import re

def parse_emoji_sequences(file_path):
    emoji_data = []
    pattern = re.compile(r"^(?P<code_points>[0-9A-F\s.]+)\s+;\s+(?P<type_field>[A-Za-z_]+)\s+;\s+(?P<description>.+?)(\s+#\s+(?P<comments>.+))?$")

    def unicode_to_char(code_points):
        """Convert Unicode code points to a single character."""
        return "".join(chr(int(cp, 16)) for cp in code_points)

    def generate_html_codes(code_points):
        """Generate HTML decimal and hexadecimal codes."""
        html_dec = "".join(f"&#{int(cp, 16)};" for cp in code_points)
        html_hex = "".join(f"&#x{cp};" for cp in code_points)
        return html_dec, html_hex

    def expand_range(code_range):
        """Expand a range of Unicode code points (e.g., 1F411..1F412)."""
        if ".." in code_range:
            start, end = code_range.split("..")
            return [f"{i:04X}" for i in range(int(start, 16), int(end, 16) + 1)]
        return [code_range]

    def split_description(description, expanded_codes):
        """
        Split descriptions for emoji ranges.
        If the description contains "..", split it into parts.
        Otherwise, repeat the description for all expanded codes.
        """
        if ".." in description:
            split_desc = description.split("..")
            # Ensure the number of descriptions matches the number of expanded codes
            if len(split_desc) == len(expanded_codes):
                return split_desc
            else:
                # Pad with the last description if lengths don't match
                return split_desc + [split_desc[-1]] * (len(expanded_codes) - len(split_desc))
        return [description] * len(expanded_codes)  # Repeat description for each code point

    def check_fitzpatrick_support(code_points):
        """Check if the emoji supports Fitzpatrick skin tone modifiers."""
        # Simplified check: Fitzpatrick codes are in the range 1F3FB-1F3FF.
        return any(0x1F3FB <= int(cp, 16) <= 0x1F3FF for cp in code_points)

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            if not line or line.startswith("#"):
                continue  # Skip empty lines and comments
            match = pattern.match(line)
            if match:
                fields = match.groupdict()
                code_ranges = fields["code_points"].split()
                for code_range in code_ranges:
                    expanded_codes = expand_range(code_range)
                    descriptions = split_description(fields["description"], expanded_codes)
                    for i, code in enumerate(expanded_codes):
                        code_points = [code]
                        emoji_char = unicode_to_char(code_points)
                        html_dec, html_hex = generate_html_codes(code_points)
                        supports_fitzpatrick = check_fitzpatrick_support(code_points)

                        emoji_entry = {
                            "emojiChar": emoji_char,
                            "emoji": emoji_char,
                            "description": descriptions[i],
                            "aliases": [],  # No aliases provided in the file, keep empty
                            "tags": [],     # No tags provided in the file, keep empty
                            "unicode": " ".join(code_points),
                            "htmlDec": html_dec,
                            "htmlHex": html_hex,
                            "supports_fitzpatrick": supports_fitzpatrick
                        }
                        emoji_data.append(emoji_entry)

    return emoji_data

def save_to_json(data, output_path):
    with open(output_path, 'w', encoding='utf-8') as json_file:
        json.dump(data, json_file, indent=4, ensure_ascii=False)

# Input and output file paths
input_file_path = 'emoji-sequences.txt'
output_file_path = 'parsed_emoji_sequences.json'

# Parse and save
parsed_data = parse_emoji_sequences(input_file_path)
save_to_json(parsed_data, output_file_path)

print(f"Parsed emoji sequences saved to {output_file_path}")

Currently i've manually downloaded this txt: https://unicode.org/Public/emoji/latest/emoji-sequences.txt and ran script for it. We can update the script later so it automatically downloads that file, then parses.

I see an issue that unicode's emoji-sequences.txt doesn't contain aliases and tags. Any ideas where can we get those? (If it's important for the lib)

I've attached raw emoji-sequences.txt and parsed parsed_emoji_sequences.json as an example:
emoji-sequences.txt
parsed_emoji_sequences.json

@wax911
Copy link
Member Author

wax911 commented Dec 12, 2024

Amazing, will circle back over the weekend. I don't mind changing the structure of the json file, we'll just release a major version which is a small price to pay if it means that we can keep up with the latest emoji standard

@wax911
Copy link
Member Author

wax911 commented Dec 15, 2024

@DanyloOliinykSSG I was planning on using a secondary data source https://cdn.jsdelivr.net/npm/emojibase-data@latest/en/data.json to obtain additional meta-data, like tags/alias the only limitation with both approaches is that when we release this new version the tag/aliases may differ.

Some insights into supports_fitzpatrick which I can infer from the emoji-base package. The reason why this exists is to shrink the total size of the emoji.json file by removing emojis that need skin-tone modifiers as we can do that on-demand without the need of replication of those emojis.

An additional improvement is also adding a supports_gender which we can use for the exact same use-case as fitzpatrick, but this might be a nice to have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

2 participants