2

I'm building a command line tool using Lua, users may call my script with utf8 arguments.

Programming in Lua 4th edition says:

Several things in Lua “just work” for UTF-8 strings. Because Lua is 8-bit clean, it can read, write, and store UTF-8 strings just like other strings.

but this seems not true for CLI parameters; here is a small test:

test.lua contains:

io.write(arg[1])

I run it like that:

lua test.lua かسжГ > test.txt

I get

????

and I get the same result with:

io.open("test.txt", "wb"):write(arg[1])

Test done with lua-5.4.8_Win32 on win 7 x64

How to solve it? Is there a workaround?

Update

I don't think this is a duplicate of How can I use Unicode characters on the Windows command line?.

That link talks about chcp 65001 which I already tested and got the same result, because chcp changes the console's code page, it doesn't automatically force all applications launched from that CMD session to operate in full UTF-8 mode internally like it happens in Linux with LC_ALL.

Many older Windows applications and even parts of the Windows API (often referred to as "ANSI" APIs) still rely on the system's default ANSI code page. If these applications don't explicitly use Unicode (UTF-16) APIs, they might still misinterpret or mangle UTF-8 data, even if the console is set to 65001.

One of the answers in that link says the same thing:

I see several answers here, but they don't seem to address the question—the user wants to get Unicode input from the command line.

Windows uses UTF-16 for encoding in two byte strings, so you need to get these from the OS in your program. There are two ways to do this—

  1. Microsoft has an extension that allows main to take a wide character array: int wmain(int argc, wchar_t *argv[]); https://msdn.microsoft.com/en-us/library/6wd819wh.aspx

  2. Call the Windows API to get the Unicode version of the command line wchar_t win_argv = (wchar_t)CommandLineToArgvW(GetCommandLineW(), &nargs); CommandLineToArgvW function (shellapi.h)

Read UTF-8 Everywhere for detailed information, particularly if you are supporting other operating systems.

I even tested in Cygwin and I got the same result:

enter image description here

This is because Lua itself does not use GetCommandLineW (I searched the source code and I could not find it), so no shell/console will solve it however you force it. Something should be done from inside Lua to solve it, and I am afraid that the only solution is to hack lua.c or create a DLL that uses GetCommandLineW, but I'm new to Lua and I have basic experience with C, so I wanted to know whether there is an easier way. I searched and I did not find anyone talking about this problem, so I thought that the problem is in my code, but it seems that the problem is in Lua (I hope to be wrong).

3
  • 1
    Works fine on my machine (GNU 9.4 / Linux 6.12.31, xfce4-terminal 1.1.5, lua 5.4.6, nvim 0.11.2), as expected. Is the character encoding(s) for the program(s) you use to run your command / executable and read the resulting file set correctly (i.e., your terminal emulator)? Commented Jun 13 at 4:59
  • @Oka please see the edit Commented Jun 13 at 7:28
  • @thebusybee please see the edit at the end of my question Commented Jun 13 at 7:28

1 Answer 1

3

This is a known flaw in Windows terminal, Lua is not a unicode program, so Windows always passes command line arguments in the OEM encoding to Lua regardless of the active code page.

Workaround 1 is to change the OEM encoding to UTF-8: https://superuser.com/a/1435645/995824. Note that this is a global setting.

Workaround 2 is to read a UTF-8 encoded file instead of passing it through the command line. Here is an auto version, it creates a temp file and pass the argument through redirection (suprise, Windows remains encoding in redirection):

run.cmd:

@ECHO OFF
CHCP 65001
ECHO %1 > temp.txt
lua.exe test.lua < temp.txt
DEL temp.txt

test.lua:

local txt = io.stdin:read('l')
-- note: txt will contain a trailing \r

run:

run.cmd かسжГ > test.txt

Workaround 3 is to update the main function in lua.c to use CommandLineToArgvW, and recompile lua.

remove main function and replace it with


// int main (int argc, char **argv) {
// Why static?
// We make real_main() static to limit its scope to this .c file only. It’s a helper, not a global function — so we keep it private.
static int real_main(int argc, char **argv) {

  int status, result;
  lua_State *L = luaL_newstate();  /* create state */
  if (L == NULL) {
    l_message(argv[0], "cannot create state: not enough memory");
    return EXIT_FAILURE;
  }
  lua_gc(L, LUA_GCSTOP);  /* stop GC while building state */
  lua_pushcfunction(L, &pmain);  /* to call 'pmain' in protected mode */
  lua_pushinteger(L, argc);  /* 1st argument */
  lua_pushlightuserdata(L, argv); /* 2nd argument */
  status = lua_pcall(L, 2, 1, 0);  /* do the call */
  result = lua_toboolean(L, -1);  /* get result */
  report(L, status);
  lua_close(L);
  return (result && status == LUA_OK) ? EXIT_SUCCESS : EXIT_FAILURE;
}

// Prevent <windows.h> from pulling in almost *every*
// Windows header (saves compile time & namespace pollution).
// What does WIN32_LEAN_AND_MEAN mean?
// When you #include <windows.h>, by default it drags in a huge set of APIs (graphics, networking, multimedia, COM, etc.), which:
    // Slows down compilation,
    // Pollutes the global namespace with tons of macros and typedefs,
    // Can lead to name clashes or unexpected dependencies.
// By defining WIN32_LEAN_AND_MEAN before including windows.h, you tell it to skip loading the least used parts of the API, giving you a slimmer, faster compile with only the core kernel and user interface functions. It doesn’t change functionality — it just leaves out the rarely needed headers so your build is cleaner and faster.
#define WIN32_LEAN_AND_MEAN

#include <windows.h>    // for CommandLineToArgvW, LocalFree
#include <shellapi.h>   // for CommandLineToArgvW prototype
#include <stdlib.h>     // for malloc/free, EXIT_FAILURE


// Our single entry point: always a console ‘main’, no subsystem tricks.
// int main(int argc_unused, char **argv_unused) {
int main(void) {
  // 1) Grab the *true* Unicode command line from the OS.
  //    CRT’s argv is already lost if launched in UTF 8 mode,
  //    so we ask the shell directly.
  int wargc;
  wchar_t **wargv = CommandLineToArgvW(GetCommandLineW(), &wargc);
  if (!wargv) return EXIT_FAILURE;

  // 2) Build a parallel UTF 8 argv[] array of char*.
  //    We’ll pass this to the Lua engine.
  char **argv = malloc((wargc + 1) * sizeof(char*));
  if (!argv) {
    LocalFree(wargv);
    return EXIT_FAILURE;
  }

  for (int i = 0; i < wargc; i++) {
    // Figure out how many bytes we need in UTF 8 (including the '\0').
    int need = WideCharToMultiByte(
                  CP_UTF8,                // convert *to* UTF-8
                  0,                      // default flags
                  wargv[i], -1,           // input wchar_t*
                  NULL, 0,                // output buffer = NULL → length only
                  NULL, NULL              // no fallback chars
               );
    if (need <= 0) {
      argv[i] = NULL;
      continue;
    }

    // Allocate & convert
    argv[i] = malloc(need);
    if (!argv[i]) {
      // on malloc failure, clean up what we already did
      for (int j = 0; j < i; j++) free(argv[j]);
      free(argv);
      LocalFree(wargv);
      return EXIT_FAILURE;
    }
    WideCharToMultiByte(
      CP_UTF8, 0, wargv[i], -1,
      argv[i], need,
      NULL, NULL
    );
  }
  argv[wargc] = NULL;  // null terminate the list

  // 3) Call the original Lua startup, passing our UTF 8 args.
  int result = real_main(wargc, argv);

  // 4) Free everything
  for (int i = 0; i < wargc; i++) free(argv[i]);
  free(argv);
  LocalFree(wargv);

  return result;
}

then edit Makefile the one inside src to add -lshell32 to gcc
so
$(CC) -shared -o
becomes
$(CC) -shared -lshell32 -o

then call chcp 65001 before you call lua

Sign up to request clarification or add additional context in comments.

5 Comments

I cannot tell to users of this tool to change OEM encoding to UTF-8 just to use my tool, and this exist only in newer win10 versions. Workaround 2 works because pipe have not this problem, the problem is only in reading cli args, but i cannot tell to the users to use this workaround just to be able to use my tool, nobody will use my tool :/ . i have to solve it my self so my tool can be used like any other shell tools
You can put the script along with the tool and tell the users to use the script to start, just like many tools do.
what a cool idea,I did not think about it, i voted your answer but did not click on accept wishing a better solution to come, because the tool works like sed using PCRE2, so the args may contain any meta-character even double quotes which will be error prone to parse it from batch, and if this works i will have to parse in lua the full arg, will be a nightmare because it s a regex.and the tool will be bundled as an exe,i dont know if it will be possible to make batch the recipient instead of lua i never bundled a lua script yet, don t know how it works yet. hacking lua.c will be easier
I added a hacking lua.c method
Workaround 3 didn't work; _tmain with MinGW-w64 causes issues, as seen here reddit.com/r/cpp_questions/comments/1j20wi2/… . they suggest to use CommandLineToArgvW (AI said the same). I've implemented a working solution using it (with the help of AI); it worked perfectly, I replaced your answer with this working one, can I update your answer for others' benefit, or should I delete my edit?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.