2

I need to merge all txt-files in a certain folder on my computer. There's hundreds of them and they all have a different name, so any code where you had to manually type the name of the files in order to merge them was not working for me. The files are in "UTF-8"-encoding and contain emojis and characters from different languages (such as Cyrillic script) as well as characters with accents and so on (e.g. é, ü, à...). A fellow stackoverflow-user was so kind as to give me the following code to run in Powershell:

(gc *.txt) | out-file newfile.txt -encoding utf8

It works wonderfully for merging the files. However, it actually gives me a txt-file with "UTF-8 with BOM"-encoding, instead of with "UTF-8"-encoding. Furthermore, all emojis and special characters have been removed and exchanged for others, such as "ü" instead of "ü". It is very importatnt for what I am doing that these emojis and special characters remain.

Could someone help me with tweaking this code (or suggesting a different one) so it gives me a merged txt-file with "UTF-8"-encoding that still contains all of the special characters? Please keep in mind that I am a layperson.

Thank you so much in advance for your help and kind regards!

9
  • Have you tried UTF8NoBOM? Get-Content also supports encoding specification, which the sample doesn't utilize. Commented Nov 8, 2019 at 9:56
  • @vonPryz Firstly, thank you for reacting! I tried it out, but (gc *.txt) | out-file newfile.txt -encoding UTF8NoBOM only gives me an error that: Out-File: Cannot validate argument on parameter 'Encoding'. The argument "UTF8NoBOM" does not belong to the set "unknown;string;unicode;bigendianunicode;utf8;utf7;utf32;ascii;default;oem" specified by the ValidateSet attribute. Supply an argument that is in the set and then try the command again. Commented Nov 8, 2019 at 10:03
  • The NoBOM requires Powershell 6; you've got older version. Anyway, does it help if you specify UTF8 to Get-Content? Try also a work-around via .Net. Commented Nov 8, 2019 at 10:12
  • @vonPryz Oh, that explains my problem at least partially. The code I used was (gc *.txt) | out-file newfile.txt -encoding UTF8. If that is what you mean then unfortunately it didn't work. It always gives me a txt-file with "UTF-8 with BOM". I looked at the work-around (thank you!) you sent me, but there's a lot of information there and I'm not really sure what to use. Commented Nov 8, 2019 at 10:21
  • For PS 5 you at least need (gc *.txt -encoding utf8) if the input files are utf8nobom. But PS 5 can't save as utf8nobom (.net ?). Commented Nov 8, 2019 at 13:13

2 Answers 2

4

In PowerShell < 6.0, the Out-File cmdlet does not have a Utf8NoBOM encoding.
You can however write Utf8 text files without BOM using .NET:

Common for all methods below

$rootFolder = 'D:\test'  # the path where the textfiles to merge can be found
$outFile    = Join-Path -Path $rootFolder -ChildPath 'newfile.txt'

Method 1

# create a Utf8NoBOM encoding object
$utf8NoBom = New-Object System.Text.UTF8Encoding $false  # $false means NoBOM
Get-Content -Path "$rootFolder\*.txt" -Encoding UTF8 -Raw | ForEach-Object {
    [System.IO.File]::AppendAllText($outFile, $_, $utf8NoBom)
}

Method 2

# create a Utf8NoBOM encoding object
$utf8NoBom = New-Object System.Text.UTF8Encoding $false  # $false means NoBOM
Get-ChildItem -Path $rootFolder -Filter '*.txt' -File | ForEach-Object {
    [System.IO.File]::AppendAllLines($outFile, [string[]]($_ | Get-Content -Encoding UTF8), $utf8NoBom)
}

Method 3

# Create a StreamWriter object which by default writes Utf8 without a BOM.
$sw = New-Object System.IO.StreamWriter $outFile, $true  # $true is for Append
Get-ChildItem -Path $rootFolder -Filter '*.txt' -File | ForEach-Object {
    Get-Content -Path $_.FullName -Encoding UTF8 | ForEach-Object {
        $sw.WriteLine($_)
    }
}
$sw.Dispose()
Sign up to request clarification or add additional context in comments.

1 Comment

I simply ran the command using the latest version of PowerShell (7.4) and it worked without any issues.
1

PS 5 (gc) can't handle utf8 no bom input files without the -encoding parameter:

(gc -Encoding Utf8 *.txt) | out-file newfile.txt -encoding utf8

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.