Verilog parsing ambiguity

Question

I am struggling with a couple slightly strange conflicting conventions in Verilog. I have written my own parser, but I am uncertain how to resolve a few things. Looking at the Verilog Spec, I am still quite confused.

The statement a && b could be parsed as either:

Logical AND of a and b (canonical interpretation)
reduction AND of b, then bitwise AND with a or a &(&b) in some senses

There seems to be the following convention:

When using unary reduction operators (e.g., &a, |b, ^c), whitespace is not allowed between the operator and the operand.
To avoid ambiguity, whitespace is sometimes necessary

This all gets a little more hairy with the following sets of expressions:

a &&&b
a &&&&b
a &&&&&b
...

Also, the entire above conversation can be repeated around other unary operators | and ^~ I believe as well...

what is your specific question? ... please add a focused, answerable question to your post — jsotola
– jsotola, Commented Jan 22 at 3:08

Tom Carpenter · Accepted Answer · 2025-01-23 00:34:05Z

From the spec you link to, we have:

3.1 Lexical tokens

Verilog HDL source text files shall be a stream of lexical tokens. A lexical token shall consist of one or more characters. The layout of tokens in a source file shall be free format; that is, spaces and newlines shall not be syntactically significant other than being token separator

So the verilog file is just a free format arrangement of tokens. The question is how will they split. There is basically no looking back - it is simply a stream split up into tokens one by one.

Let's taken an example without whitespace:

a&&b

The parser will see a, then it finds a character which can't be part of that token (special character can't be in a name), so it starts parsing the next. It will run through bot && because the second special character can be part of the first. Then reaches b which is not special so starts the next token.

With whitespace, from the spec you link:

3.2 White space

White space shall contain the characters for spaces, tabs, newlines, and formfeeds. These characters shall be ignored except when they serve to separate other lexical tokens.

Whitespace is ignored except for forcefully separating tokens, so your suggestion that:

When using unary reduction operators whitespace is not allowed between the operator and the operand.

Is not required. The following works perfectly well:

b = & a;

The whitespace has no impact on the token splitting. Context will infer it to be unary reduction. The only reason to not include the whitespace is purely stylistic - making it easier for us to read.

For your example:

a && b

The whitespace is unnecessary - it will force splitting the a, &&, and b as the tokens, but is not required. All three are unambiguous anyway. See later for why && itself is unambiguous.

If you did:

a & & b

Now the whitespace comes into force. It separates the two & into two different tokens. The parser can then work out based on the context that the first is bitwise and the second is reduction. Operator precedence means the unary operation occurs first.

An intersting one is:

a &&&& b

3.4 Operators

Operators are single-, double-, or triple-character sequences and are used in expressions. Clause 5 discusses the use of operators in expressions

It seems however that there is a certain amount of sanity in the splitting which I can't find reference to yet, is that it checks the validity of tokens when splitting.

So the token splitting for those operators might theoretically result in the tokens &&& and & because they can't be more than three characters. Let's look further.

Theoretically this should work for:

a &&& b

The middle token &&& is slightly ambiguous in that for newer versions of Verilog that is a valid token, so a supporting parser will not try to decompose the middle token any further. Older parsers may well not know about it, so split it after the second ampersand into && and & - a logical then a unary AND.

So your four ampersands may decode into: a, &&, &&, and b on older parsers that didn't know about the triple ampersand operator. This would result in a syntax error because you can't have two logical ANDs with no other token between.

If we look at the case:

a ||| b

The sanity checks will decompose this into || and |, because ||| is not a valid token anywhere AFAIK, so it stops after || and moves on to the next.

For this more weird example:

a &&~| b

After && the next character ~ would result in an invalid token, so it stops and starts a new token, resulting in the decomposition into tokens: a, &&, ~|, and b. A logical AND followed by unary NOR.

The key thing is the tokeniser will not split until it either reaches whitespace or an invalid character for the current token, so you will not see for example && ever being split because the double ampersand is a valid token - unless you use whitespace to force it to happen.

Another example based on your reply in the comments:

b = &&a;

The token parsing is exactly the same rules as above. Run forward splitting whenever you reach either an invalid operator, or a character not valid for the current token.

You will get four tokens: b, =, && and a. The double && is still not split up, because && is a valid token and there was no white space forcing a new token to begin.

The result is your synthesis tool issues a syntax error during compile because the logical AND (&&) is not a unary operator.

I think this has a lot of good insights, but the && will be split in the case of a leading double unary like &&a. I bring this all up because I have been writing my own parser which I have run against a large number things from real world test cases... It isn't clear to me there is an unambigous way to tokenize more generally speaking... — meawoppl
– meawoppl, Commented Jan 22 at 18:08
@meawoppl in the example &&a, this will not be split up into two unary operators at all. Parse the tokens in order as a stream without memory of previous tokens. First character is & so we are in an operator. Next is & - but && is a valid token, so keep going (don't start a new token!). Next character is a, which is not part of an operator, so only then do we start a new token. The result is you will parse into two tokens: && followed by a. You now issue a syntax error during the synthesis phase because && is not a unary operation. — Tom Carpenter
– Tom Carpenter, Commented Jan 23 at 0:35
In fact technically even b = & &a should issue a syntax error even though you do get the tokens &,&, and a. This is because you can't apply a unary operator to an operator - unary must be <operator> [value], so <&> [&] is a syntax error because the second token is not a value. &(&a) would be valid because the () are special separators to ensure internal tokens are computed first and results in a value. — Tom Carpenter
– Tom Carpenter, Commented Jan 23 at 0:45

toolic · Accepted Answer · 2025-01-21 23:50:51Z

There is no ambiguity in the IEEE Std 1800-2023 regarding &&. It is unambiguous; it is the Logical AND. It is not the reduction-bitwise AND as you stated.

Keep in mind that the 1364-2005 spec that you linked has been superseded by 1800.

a &&& b is a compile error on multiple simulators. Of course, this depends on context since SystemVerilog did introduce the &&& operator in some contexts.

For a collection of free Verilog simulators, see the EDA Playground site.

When using unary reduction operators, whitespace is allowed between the operator and the operand. Simulators treat |a and | a the same way.

Stack Exchange Network

Verilog parsing ambiguity

2 Answers 2

Your Answer

Hot Network Questions

Verilog parsing ambiguity

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions