Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-ASCII command-line arguments are mangled on Windows #11558

Closed
HertzDevil opened this issue Dec 9, 2021 · 2 comments · Fixed by #11564
Closed

Non-ASCII command-line arguments are mangled on Windows #11558

HertzDevil opened this issue Dec 9, 2021 · 2 comments · Fixed by #11564
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. platform:windows Windows support based on the MSVC toolchain / Win32 API topic:stdlib:system topic:stdlib:text

Comments

@HertzDevil
Copy link
Contributor

HertzDevil commented Dec 9, 2021

If you try to pass any non-ASCII command-line arguments to a Crystal program built on Windows, they are transcoded to Windows-1252:

> type test.cr
p ARGV

> crystal build test.cr

> test €×‽😂
["\x80\xD7???"]

This happens even before the entry point is called, regardless of the current console codepage. A solution is to use the wide entry point wmain on Windows only, and convert the command-line arguments back to UTF-8 before executing any top-level code:

# src/crystal/main.cr

module Crystal
  # *argv*'s type restriction changed
  def self.main(argc : Int32, argv : UInt8** | UInt16**)
    # same body as before
  end

  # *argv*'s type restriction changed
  def self.main_user_code(argc : Int32, argv : UInt8** | UInt16**)
    # at this point the GC has been initialized so we can do this
    if argv.is_a?(UInt16**)
      argv = Slice.new(argc) do |i|
        String.from_utf16(argv[i])[0].to_unsafe
      end.to_unsafe
    end
    LibCrystalMain.__crystal_main(argc, argv)
  end
end

{% if flag?(:win32) %}
  fun wmain(argc : Int32, argv : UInt16**) : Int32
    Crystal.main(argc, argv)
  end
{% else %}
  fun main(argc : Int32, argv : UInt8**) : Int32
    Crystal.main(argc, argv)
  end
{% end %}

We cannot simply call GetCommandLineW / CommandLineToArgvW and ignore argc and argv in Crystal.main, because that method might be captured and passed somewhere (apparently this use case is publicly documented). But we may be able to do this in fun main, without the standard library:

lib LibC
  CP_UTF8 = 65001

  fun GetCommandLineW : LPWSTR
  fun CommandLineToArgvW(lpCmdLine : LPWSTR, pNumArgs : Int*) : LPWSTR*
  fun WideCharToMultiByte(
    codePage : DWORD, dwFlags : DWORD, lpWideCharStr : WCHAR*,
    cchWideChar : Int, lpMultiByteStr : LPSTR, cbMultiByte : Int,
    lpDefaultChar : CHAR*, lpUsedDefaultChar : BOOL*
  ) : Int
  fun LocalFree(hMem : Void*) : Void*
end

fun main(argc_ : Int32, argv_ : UInt8**) : Int32
  if utf16_argv = LibC.CommandLineToArgvW(LibC.GetCommandLineW, out argc)
    argv = LibC.malloc(sizeof(UInt8*) * argc).as(UInt8**)
    argc.times do |i|
      utf8_size = LibC.WideCharToMultiByte(LibC::CP_UTF8, 0, utf16_argv[i], -1, nil, 0, nil, nil)
      argv[i] = LibC.malloc(utf8_size).as(UInt8*)
      LibC.WideCharToMultiByte(LibC::CP_UTF8, 0, utf16_argv[i], -1, argv[i], utf8_size, nil, nil)
    end

    status = Crystal.main(argc, argv)

    argc.times do |i|
      LibC.free(argv[i])
    end
    LibC.free(argv)
    LibC.LocalFree(utf16_argv)

    status
  else
    Crystal::System.print_error "Failed to parse command-line arguments!\n"
    1
  end
end
@HertzDevil HertzDevil added kind:bug A bug in the code. Does not apply to documentation, specs, etc. platform:windows Windows support based on the MSVC toolchain / Win32 API topic:stdlib:system topic:stdlib:text labels Dec 9, 2021
@HertzDevil
Copy link
Contributor Author

HertzDevil commented Dec 9, 2021

Also consider the possibility of writing wmain in C while keeping Crystal's fun main:

> cl /c /MT main.c
> crystal build test.cr

where main.c is defined as:

#include <windows.h>

extern int main(int argc, char *argv[]);

int __stdcall wmain(int argc, wchar_t *utf16_argv[]) {
    char **argv = malloc(sizeof(char *) * argc);
    for (int i = 0; i < argc; ++i) {
        int utf8_size = WideCharToMultiByte(CP_UTF8, 0, utf16_argv[i], -1, NULL, 0, NULL, NULL);
        argv[i] = malloc(utf8_size);
        WideCharToMultiByte(CP_UTF8, 0, utf16_argv[i], -1, argv[i], utf8_size, NULL, NULL);
    }

    int status = main(argc, argv);

    for (int i = 0; i < argc; ++i)
        free(argv[i]);
    free(argv);

    return status;
}

and we force Crystal to link this object file: (the exact directory might differ depending on how we choose to distribute this file, but most likely it would stay somewhere within Crystal::LIBRARY_PATH, and we can search for it like the recently added openssl_VERSION)

# src/crystal/main.cr
# ditto for src/empty.cr

{% if flag?(:win32) %}
  @[Link(ldflags: "#{__DIR__}/../../main.obj /ENTRY:wmainCRTStartup")]
{% end %}
lib LibCrystalMain
  @[Raises]
  fun __crystal_main(argc : Int32, argv : UInt8**)
end

Such a shim written in C or C++ might be necessary in the future because it allows us to catch access violations / stack overflows with a SEH exception handler; to date LLVM doesn't provide any SEH intrinsics.

If we don't need SEH, of course defining both main and wmain in Crystal is doable too:

{% if flag?(:win32) %}
  @[Link(ldflags: "/ENTRY:wmainCRTStartup")]
  lib LibCrystalMain
    @[Raises]
    fun __crystal_main(argc : Int32, argv : UInt8**)
  end

  fun wmain(argc : Int32, utf16_argv : UInt16**) : Int32
    argv = LibC.malloc(sizeof(UInt8*) * argc).as(UInt8**)
    argc.times do |i|
      utf8_size = LibC.WideCharToMultiByte(LibC::CP_UTF8, 0, utf16_argv[i], -1, nil, 0, nil, nil)
      argv[i] = LibC.malloc(utf8_size).as(UInt8*)
      LibC.WideCharToMultiByte(LibC::CP_UTF8, 0, utf16_argv[i], -1, argv[i], utf8_size, nil, nil)
    end

    status = main(argc, argv)

    argc.times do |i|
      LibC.free(argv[i])
    end
    LibC.free(argv)

    status
  end
{% end %}

@HertzDevil HertzDevil changed the title UTF-8 command-line arguments are mangled on Windows Non-ASCII command-line arguments are mangled on Windows Dec 9, 2021
@HertzDevil
Copy link
Contributor Author

HertzDevil commented Dec 10, 2021

We do not need a C shim since I have confirmed that vectored exception handlers written in Crystal will work: #11570

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. platform:windows Windows support based on the MSVC toolchain / Win32 API topic:stdlib:system topic:stdlib:text
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant