I'm building a command line tool using Lua, users may call my script with utf8 arguments.
Programming in Lua 4th edition says:
Several things in Lua “just work” for UTF-8 strings. Because Lua is 8-bit clean, it can read, write, and store UTF-8 strings just like other strings.
but this seems not true for CLI parameters; here is a small test:
test.lua contains:
io.write(arg[1])
I run it like that:
lua test.lua かسжГ > test.txt
I get
????
and I get the same result with:
io.open("test.txt", "wb"):write(arg[1])
Test done with lua-5.4.8_Win32 on win 7 x64
How to solve it? Is there a workaround?
Update
I don't think this is a duplicate of How can I use Unicode characters on the Windows command line?.
That link talks about chcp 65001 which I already tested and got the same result, because chcp changes the console's code page, it doesn't automatically force all applications launched from that CMD session to operate in full UTF-8 mode internally like it happens in Linux with LC_ALL.
Many older Windows applications and even parts of the Windows API (often referred to as "ANSI" APIs) still rely on the system's default ANSI code page. If these applications don't explicitly use Unicode (UTF-16) APIs, they might still misinterpret or mangle UTF-8 data, even if the console is set to 65001.
One of the answers in that link says the same thing:
I see several answers here, but they don't seem to address the question—the user wants to get Unicode input from the command line.
Windows uses UTF-16 for encoding in two byte strings, so you need to get these from the OS in your program. There are two ways to do this—
Microsoft has an extension that allows main to take a wide character array: int wmain(int argc, wchar_t *argv[]); https://msdn.microsoft.com/en-us/library/6wd819wh.aspx
Call the Windows API to get the Unicode version of the command line wchar_t win_argv = (wchar_t)CommandLineToArgvW(GetCommandLineW(), &nargs); CommandLineToArgvW function (shellapi.h)
Read UTF-8 Everywhere for detailed information, particularly if you are supporting other operating systems.
I even tested in Cygwin and I got the same result:
This is because Lua itself does not use GetCommandLineW (I searched the source code and I could not find it), so no shell/console will solve it however you force it. Something should be done from inside Lua to solve it, and I am afraid that the only solution is to hack lua.c or create a DLL that uses GetCommandLineW, but I'm new to Lua and I have basic experience with C, so I wanted to know whether there is an easier way. I searched and I did not find anyone talking about this problem, so I thought that the problem is in my code, but it seems that the problem is in Lua (I hope to be wrong).

GNU 9.4 / Linux 6.12.31,xfce4-terminal 1.1.5,lua 5.4.6,nvim 0.11.2), as expected. Is the character encoding(s) for the program(s) you use to run your command / executable and read the resulting file set correctly (i.e., your terminal emulator)?