6

In a survey dataset I have a string variable (type: str244) with qualitative responses. I want to count the number of characters in each response/string and generate a new variable containing this number.

Using the egenmore I have already counted the number of words using nwords, but I cannot find the counterpart for counting characters.

EXAMPLE:

egen countvar = nwords(stringvar)

where countvar is the new variable name and stringvar is the string variable.

Does such an egen function exist for counting characters?

3
  • 1
    The function wordcount() in Stata makes the older add-on nwords() redundant. Note egenmore is downloaded using ssc inst egenmore. Commented Aug 5, 2015 at 18:06
  • 1
    The help for egenmore does point to wordcount(). N.B. nwords() (written for Stata 6) is very slow. Commented Aug 5, 2015 at 18:14
  • Thank you for mentioning this. gen countvar = wordcount(stringvar) works like a charm. I wasn't aware that wordcount was used with gen, not egen. Perfect! Commented Aug 5, 2015 at 19:10

2 Answers 2

11

There is no egen function because there has long [sic] been a function strict sense to do this. In recent versions of Stata, the function is called strlen() but the older name length() continues to work:

. sysuse auto
(1978 Automobile Data)

. gen l1 = length(make)

. gen l2 = strlen(make)

. su l?

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          l1 |         74    11.77027    2.155257          6         17
          l2 |         74    11.77027    2.155257          6         17

See help functions and (e.g.) this tutorial column.

Sign up to request clarification or add additional context in comments.

4 Comments

What about for counting digits in a numeric variable?
That's a new question really as there are subtle differences. Do you mean integers or you include decimal parts? If you mean integers, log10(x) + 1 is a good start. If you include numbers with decimal parts, the question is a lot messier without knowing a display format.
log10(x)+1 breaks down with larger numbers (at least it does in the console). You need to wrap it in floor before adding 1. Try these two 9 digit numbers di log10(999999999)+1 di log10(999099999)+1 di floor(log10(999999999))+1 di floor(log10(999099999))+1
@ChrisD That's the kind of detail that made me say "a good start".
-1
. sysuse auto,clear
(1978 Automobile Data)

. tostring price, gen(price1)
price1 generated as str5

. gen l3=length(price1)

. sum l3

    Variable |        Obs        Mean    Std. Dev.       Min        Max

          l3 |         74    4.135135    .3442015          4          5

2 Comments

in case u want the count of numeric variable
This has to seem naive. See my comment underneath my answer. The "length" of a numeric variable is well defined only in certain cases. In your example, price is reported as a positive integer, and for that you don't need to convert to a string variable. You just need to push the maximum value through ceil(log10()). Your code could be problematic for variables in which any numeric value was negative or contained fractional parts, depending on precision issues and what you want precisely.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.