4

I'm studying how tokenization works in C programming, especially for examination purposes. I want to understand how many tokens are present in the following C code and how preprocessor directives are counted as tokens.

Here’s the code I'm analyzing:

#include <stdio.h> 
#define PI 3.14159
#define SQUARE(x) ((x) * (x))

int main() {
    float radius = 5.0;
    float area = PI * SQUARE(radius);
    printf("Area of circle: %.2f\n", area);
    return 0;
}

I'm confused about the following:

  1. Are preprocessor directives like #include and #define counted when tokenizing in the C language (When ask as a question in exam paper)?
  2. if this count as token then <stdio.h> considered a single token or separate tokens like <,stdio.h, >?
  3. Should we count macro parameters like (x) and macro body tokens?
  4. How many total tokens are there in this code from a C compiler's perspective?
  5. What is the exact rule or reference from the C standard or GCC documentation that explains this clearly?
7
  • 4
    It used to be that the preprocessor was a separate program, using its own language, and its input resulted in a pure C file as output. It has been very different the last couple of decades, where the preprocessor is a part of the compiler itself. C compilers also doesn't have a separate tokenization phase for a long time, so the answer is actually not trivial. Commented May 28 at 8:35
  • 1
    The standard defines both pre-processing tokens and tokens. The pre-processing tokens get converted to tokens during translation phase 7, and there are some steps before translation phase 7, such as concatenation of adjacent string literal tokens at translation phase 6. Commented May 28 at 9:15
  • The first phase of compilation is the preprocessor, which converts the raw source into C source. It does this by replacing all #include directives with the actual content of the files named (and any files #included within those files). It also replaces any macro references with their definitions (e.g PI will be replaced by its defined value: 3.14159. This newly created source file is then passed to the actual compiler to be turned into object code. So which tokens are you referring to? Most compilers have an option to save the preprocessed source to a file so you can examine it. Commented May 28 at 9:20
  • see as an example the documentation of MS for C lexical grammar how to break down a source code into tokens Commented May 28 at 9:42
  • 2
    Your question is confused and can't be answered, since formal C grammar separates preprocessor tokens from tokens. These don't exist at the same time - rather the former is translated into the latter and there's different rules for "tokenization". You have to ask separate questions regarding preprocessing and compilation of tokens - as things stand currently, it isn't possible to answer. Also, these are not beginner topics and probably much too broad for the Q&A format. Commented May 28 at 10:35

2 Answers 2

6
  1. Are preprocessor directives like #include and #define counted when tokenizing in the C language (When ask as a question in exam paper)?

The C grammar and analysis is specified in the C standard as parsing the source text into preprocessing tokens, which include # and include as preprocessing tokens. Then preprocessing is performed. After that, all preprocessing tokens are converted to tokens, and the main compilation occurs. (This is a conceptual order in the C standard, not necessarily the actual order used in a compiler.)

You will have to determine whether you want to count preprocessing tokens before preprocessing or count tokens after preprocessing.

#include and #define are not tokens (preprocessing tokens or tokens). The tokens are #, include, #, and define.

  1. if this count as token then <stdio.h> considered a single token or separate tokens like <,stdio.h, >?

The tokens in #include <stdio.h> are #, include, and <stdio.h>. There is a special sub-grammar for header names that results in <stdio.h> being a single token. Outside of an #include directive and certain other places where a header name is expected, <stdio.h> would be multiple tokens: <, stdio, ., h, and >. This means any parser to count tokens must be context-dependent.

  1. Should we count macro parameters like (x) and macro body tokens?

A token is a token.

  1. How many total tokens are there in this code from a C compiler's perspective?

I manually counted 53 preprocessing tokens but easily could have made a mistake.

  1. What is the exact rule or reference from the C standard or GCC documentation that explains this clearly?

There is no single rule. The grammar for a preprocessing token starts in C 2024 6.4.1 where it defines preprocessing-token to be one of header-name, identifier, pp-number, character-constant, string-literal, punctuator, a universal character name that cannot be one of the aforementioned, or a non-white-space character that cannot be one of the aforementioned. Definitions of those continue in other parts of the C standard. To count tokens, you will have to parse at least the preprocessing grammar of the C standard.

Sign up to request clarification or add additional context in comments.

Comments

-1

I was taught to treat preprocessor statements as effectively "word processing" instructions. They're not "tokens" at all in the sense of "what is compiled", though they may generate tokens.

It's quite interesting playing around with a compiler (such as gcc) that can be told to emit the source code after the preprocessor has done its work. What you get is C with all the header files appended at the top, all the uses of #define macros replaced with their results in a purely textual manner, etc.

This is what the compiler itself then tokenises, assesses, and reports errors on. It shows just how deeply unaware of modules the C langauge (outside of its preprocessor directives) actually is.

Because this is all done as "text" and not "tokens", this is why one can get into deep trouble with macros such as #define SQUARE(x) x * x, and why it's better to have #define SQUARE(x) (x*x). The former simply plonks down x * x wherever it sees SQUARE(x), with it be just smart enough to swap x for whatever was put in the brackets where the macro was used.

It's also an illustration of where preprocessor directives came from. Once upon a time in its very early stages of its development C wouldn't have had #include, and all K&R could do was compile a single source to a complete program. To make it possible to have modules, compile and then link them into a complete program they needed a way of importing the contents of one module into another at the source code level; #include, extern variables and function prototypes was the simple (and for the day, very effective) way of doing it. Arguably, some of the other preprocessor directives went too far, giving the language some seriously deep pits into which the unwary could fall...

Personally speaking, I'd assess that piece of code's token count after having been pre-processed. The trouble with the answer is that #include stdio.h> is opaque (in the context of the question asked); one cannot see how many tokens that is introducing. So I would say that the proper answer is that "it depends". I would justify it on the basis of the text as presented comprises two separate langages - C, and C-preprocessor directives, with "tokens" being a C thing and that text not be purely C.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.