1

I know that first line of python file always give the info of encoding.

But I don't know even than the first line words are encoded with specific encoding, how does editor know the correct encoding of the first line words.

thanks for you reply

6
  • File is just a bunch of bytes. The interpreter will check if the first some bytes is encoding info. See utf-8 for Python 3.x Commented Sep 27, 2020 at 9:35
  • 1
    @heLomaN I think OP's question is how the interpreter decodes the first few bytes for the encoding info without knowing the encoding for those first few bytes. Commented Sep 27, 2020 at 11:28
  • what do you mean (exactly) by "editor"? Commented Oct 30, 2020 at 16:02
  • Very strongly related: What's the difference between 'coding=utf8' and '-*- coding: utf-8 -*-'? Commented Nov 4, 2020 at 8:24
  • Basically: how an editor interprets those lines is up to each individual editor. With most editors these days defaulting to UTF-8, it is easier to just ignore the whole issue, but the PEP 263 format comment standard is specifically designed to support whatever your editor might support. Commented Nov 4, 2020 at 8:25

4 Answers 4

2
+50

This is mostly a ramble, because Codec handling in Python is a bit of a ramble.

First, the encoding line deals with the standard python library codec. It's an odd adapter pattern:

  • Odd complications around recognizing various codecs named 'utf-*'
  • The idea of 'Stream' versus 'Incremental' versus basic encode/decoders
  • explicit getregentry() and register() functions instead of using metadata.
  • Poor documentation, and lots of implementation specific tricks.
  • You can start by looking at cpython/Python/codecs.c (the CPython source), which will have more accuracy than the documentation.
  • This is an area where you might find incompatibilities between CPython, Jupyter Python, Pipi, and other implementations.
  • Here there be dragons

More specifically, the encoding line is defined by PEP 263. Because the characters are all low, it should work with endings like UTF-8, iso-8559-1, and others. It's a bit like the old Hayes modem code of "AT" which were two letters that happened to work regardless of parity and byte size settings. The most common other encoding is UTF-16 and variants, which have a BOM.

You might also look at cpython/Parser/tokenizer.c:check_coding_spec(), cpython/Parser/pegen.c:1172 calling PyTokenizer_FromFile(), and others. Its a bit of rabbit hole and you will understand too much of Python's tokenizer before you are done.

The short answer: Python originally opens the file as bytes; it's UTF-8 before leaving the tokenizer, the tokenizer checks for the BOM (Byte Order Mark) and does some magic with the codec processor to read the encoding line, and then uses the encoding line. It messy, but works in enough variants that people are satisfied.

I hope this answers your question.

Sign up to request clarification or add additional context in comments.

4 Comments

The handling in Python reflects the long and storied history of the tokenizer. PEP 263 is pretty simple and clear, and any incompatibilities between CPython, Jupyter Python, Pipi, and other implementations would be PEP violations. There are no dragons here.
Note that nothing in this answer addresses the actual question: how an editor might interpret those lines, if at all.
There are dragons in the implementation. Writing custom encoders has shown me that.
True. For an editor, its editor specific. Almost all them just set the encoding with a warning if the BOM marks don't match. Some just crash if the Unicode won't parse.
1

Each editor has it's own built-in algorithms that depend on the byte code and sometimes file extensions to determine the encoding. For most file extensions, if the editor is not able to determine the encoding, it falls back to a common encoding, which is usually UTF-8 for text files and such since it supports a larger set of signs and is widely used.

Take for example, Python itself. During the era of Python 2, the default/fallback encoding for source code was ASCII. So your first few lines where you mention your encoding should be valid ASCII for Python2 to process it. In Python 3, this has been switched to UTF-8. So, the python interpreter will interpret the first few lines as valid UTF-8 and then override it with whatever custom encoding that you provide.

6 Comments

"So your first few lines where you mention your encoding should be valid ASCII for Python2 to process it. In Python 3, this has been switched to UTF-8. " <- this is the crucial part I think. So for example if the file is encoded with an encoding FOO which happens to not be a superset of ASCII, Python has no chance of interpreting the # -*- coding: FOO -*- line?
@timgeb Encodings that are not a superset of ASCII are not supported by Python.
@MartijnPieters Thanks, got a link?
@timgeb: PEP 263 is the official reference here: Any encoding which allows processing the first two lines in the way indicated above is allowed as source code encoding, this includes ASCII compatible encodings as well as certain multi-byte encodings such as Shift_JIS. It does not include encodings which use two or more bytes for all characters like e.g. UTF-16. Note that Shift JIS is basically a superset of ASCII here (only 0x5C and 0x7E differ, but valid encoding names never use \ or ~).
@timgeb: hrm, this was all in response to your bounty? The question here is rather vague, it doesn't really narrow down if this was about an editor interpreting the Python comment or some other form of file encoding detection. If this is about the editor also interpreting the PEP 263 comment then that's up to each editor; this older answer of mine references the Emacs and VI documentation for these, but Gedit and Kate have similar support, and other editors have plugins that add modeline support.
|
0

I don't believe there is any full-proof way of knowing a file's encoding other than guessing an encoding and then trying to decode with it.

The editor might assume, for example, a UTF-8 encoding, a very common encoding capable of encoding any Unicode character. If the file decodes without errors there is nothing else to do. Otherwise, I am sure the editor has a strategy of trying certain other encodings until it succeeds and produces something without a decoding error or finally gives up. In the case of an editor that understands content, even if the file decodes without an error, the editor might additionally check to see if the content is representative of what is implied by the file type.

Comments

-1

I'm not sure if I understood your question. However all IDEs have a default encoding, which for all Python IDEs is UTF-8. At first it checks whether the codepoint is smaller than 128 or larger than 128. From that it understands whether we are using one or more bytes per character. (Therefore UTF-8, UTF-16, or so on).

Another reason the default encoding is UTF-8 is that UTF-8 can handle any Unicode code point.

You can find more info from here: https://docs.python.org/3/howto/unicode.html

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.