1

I am wondering why capturing (piping) stdout results in an empty file, while not capturing it results in normal output. I don't get an encoding error when outputting to terminal. Only when piping the output.

Does the encoding change when outputting to a pipe instead of the terminal? Maybe the terminal can signal supported encodings, while a pipe defaults to ASCII?

Piping the output

$ curl -s 'https://www.sunwind.no/Outlet/'  | html2text | wc
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 212: ordinal not in range(128)
      0       0       0

Not capturing the output

curl -s 'https://www.sunwind.no/Outlet/' | html2text
  * [Forsiden](https://www.sunwind.no/)
  * [Logg inn](/login/)
  * [Registrer meg](https://www.sunwind.no/register/)

[![](https://www.sunwind.no/images/flag/NORW0001.GIF)](https://www.sunwind.no
"Klikk her for å gå til forsiden")
[![](https://www.sunwind.no/images/flag/SWDN0001.GIF)](https://www.sunwind.se
"Klikk her for å gå til Sunwinds svenske side")
[![](https://www.sunwind.no/images/flag/FINL0001.GIF)](https://www.sunwind.fi
"Klikk her for å gå til Sunwinds finske side")
[![](https://www.sunwind.no/images/flag/DENM0001.GIF)](https://www.sunwind.no/page/?pid=132
"Klikk her for å gå til Sunwinds danske side")
[![](https://www.sunwind.no/images/flag/UK0001.GIF)](https://www.sunwind.no/en/
"Klikk her for å gå til engelsk side")

[ ![](https://www.sunwind.no/images/Logo.png)](https://www.sunwind.no/)

![](https://www.sunwind.no/images/Enjoy_Brun_logo_text.png)

  * __ 0 **kr 0,-**

Nylig lagt til i handlevognen

[Gå til handlekurven](https://www.sunwind.no/account/basket/)

[ Sunwind.no](https://www.sunwind.no/)

  * Alle produkter __

##### **[KJØKKEN OG GASS](/product/content/show/?cap=7&KJOKKEN-OG-GASS) **

    * [__Gasskomfyr](https://www.sunwind.no/product/category/?cap=13)
    * [__Innbyggingsovn gass](https://www.sunwind.no/product/category/?cap=14)
    * [__Gasstopp](https://www.sunwind.no/product/category/?cap=15)
    * [__Gasskjøleskap](https://www.sunwind.no/product/category/?cap=63)
    * [__Kjøleskap 12 volt](https://www.sunwind.no/product/category/?cap=148)
    * [__Kjøle- og fryseboks](https://www.sunwind.no/product/category/?cap=144)
    * [__Kjøkkenvifte](https://www.sunwind.no/product/category/?cap=65)
    * [__Tilbehør og turutstyr](https://www.sunwind.no/product/category/?cap=97)
    * [__Gassutstyr og monteringsmateriell](https://www.sunwind.no/product/category/?cap=20)


etc

The html2text alias is a one-line python command:

alias html2text='python -c "import sys,html2text;sys.stdout.write(html2text.html2text(sys.stdin.read().decode(\"utf-8\")))"'

This is the first time I have encountered this behaviour. Is my one-liner somehow not handling piped output?

5
  • Try unsetting LC_ALL before running... unset LC_ALL; curl ... Commented Mar 17, 2018 at 12:50
  • @MarkSetchell Same error. Commented Mar 17, 2018 at 13:25
  • 1
    Does your python command run python 2 or 3? There are issues with text encoding, but some useful defaults where added in python 3 so maybe you don't have to change your program. Commented Mar 17, 2018 at 14:31
  • @ArndtJonasson You were right that installing Python 3 fixed the problem. So thanks for that. But the question of why the behaviour of python 2 is how it is still remains :-) Commented Mar 17, 2018 at 14:51
  • I'm not an expert, but it seems that googling "python unicode" brings up useful information. Commented Mar 17, 2018 at 15:06

1 Answer 1

1

Just my 2 cents, as input data is properly decoded (.decode(\"utf-8\")), output data need to be encoded as well (.encode(\"utf-8\")). So the working version of your one liner should be as below. The why question need long time to study.

alias html2text='python2 -c "import sys,html2text;sys.stdout.write(html2text.html2text(sys.stdin.read().decode(\"utf-8\")).encode(\"utf-8\"))"'
curl -s 'https://www.sunwind.no/Outlet/'  | html2text | wc
    764    1839   29329
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! I added the encoding to the wrong part when I tried earlier. But why does it work when not piping? Does the encoding change when outputting to a pipe instead of the terminal? Maybe the terminal can signal supported encodings, while a pipe defaults to ASCII?
It works with piping. I tried with curl -s 'https://www.sunwind.no/Outlet/' | html2text | grep __ and found that grep can handle the output from html2txt. I think you are correct about terminal can support the encoding by html2txt. I also think the pipe is just stream of bytes. And programs in pipe need to have same understanding (encoding/decoding method) about the byte stream.
Sorry, writing on the internet is hard. 'it' referred to the original version without explicit encoding of the resulting stream. I know your version works with piping, while mine didn't.
Sorry that I mistakenly piped through grep without using the original version. Have a good day. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.