1

I want to create a new variable in Stata, that is a function of 3 different variables, X, Y and Z, like:

gen new_var = (((X)*3) + ((Y)*2) + ((Z)*4))/7

All observations have missing values for one or two of the variables.

When I run the aforementioned command, all it generates are missing values, because no observation has values for all 3 of the variables. I would like Stata to complete the function ignoring the missing variables.

I tried the following commands without success:

gen new_var= (cond(missing(X*3),., X) + cond(missing(Y*2),., Y))/7 
gen new_var= (!missing(X*3+Y*2+Z*4)/7)
gen new_var=  (max(X , Y, Z)/7) if missing(X , Y, Z) 

The egen command does not allow complicated functions; otherwise rowtotal() could work.


EDIT:

To clarify, "ignoring missing variables" means that even if any one of the component variables is not missing, then apply the function to only that variable and produce a value for the new variable. The new variable should have missing values only when all three component variables are missing.

2
  • egen does often allow quite complicated arguments. The limitation here is specific to rowtotal() which takes only a varlist. Commented Feb 23, 2019 at 14:17
  • Welcome to Stack Overflow. It is always best to provide some example data and the expected output. This will maximise your chances to get a helpful answer. For tips on how to improve your future questions please read How to create high quality reproducible examples in Stata. Commented Feb 23, 2019 at 14:45

2 Answers 2

2

I am going to guess that "ignoring missing values" means "treating them as zeros". If you have some other idea, you should make it explicit.

That could be

gen new_var = (cond(missing(X), 0, 3 * X) ///
+ cond(missing(Y), 0, 2 * Y) ///
+ cond(missing(Z), 0, 4 * Z)) / 7 

Let's look at your solutions and explain why they are all wrong either in general or usually.

(cond(missing(X*3),., X) + cond(missing(Y*2),., Y))/7 

It is sufficient is note that if it's true that X is missing, then cond() yields missing, as then X * 3 is missing too. The same kind of remark applies to terms involving Y and Z. So you're replacing any missing values by missing values, which is no gain.

!missing(X*3+Y*2+Z*4)/7

Given the information that at least one of X Y Z is always missing, then this always evaluates to 0/7 or 0. Even if X Y Z were all non-missing, then it would evaluate to 1/7. That is a long way from the sum you want. missing() always yields 1 or 0, and its negation thus 0 or 1.

(max(X, Y, Z)/7) if missing(X , Y, Z) 

The maximum of X, Y, Z will be the right answer if and only if one of the values is not missing and the other two are missing. max() ignores missings to the extent possible (even though in other contexts missings are treated as if arbitrarily large positive numbers).

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks Nick. I am very new to Stata, so all your explanations are really helpful to understand the mechanism behind the commands. I am searching similar queries as mine on different groups and trying the commands suggested there, but without an understanding of the logic behind them. In this calculation, I don't want to treat missing values as zeros. The new variable should have missing values only if all three component variables are missing. If any one of them is not missing, the function should be applied to only that variable to calculate the new value.
Thanks Nick and Pearly. I am going to use the first command with an additional replace command to convert all zeroes to missing values.
The calculation seems bizarre, but that is your side. replace new_var = . if missing(X) & missing(Y) & missing(Z) seems to be a way to do what you also want.
1

If you just want to "ignore missing values" without "treating them as zeros", the following will work:

clear
set obs 10

generate X = rnormal(5, 2)
generate Y = rnormal(10, 5)
generate Z = rnormal(1, 10)

replace X = . in 2
replace Y = . in 5
replace Z = . in 9

generate new_var = (((X)*3) + ((Y)*2) + ((Z)*4)) / 7 if X != . | Y != . | Z != .

list

     +---------------------------------------------+
     |        X          Y           Z     new_var |
     |---------------------------------------------|
  1. | 3.651024    3.48609    -24.1695   -11.25039 |
  2. |        .   14.14995    8.232919           . |
  3. | 3.689442   9.812483    1.154064    5.044221 |
  4. | 2.500493   13.02909     5.25539    7.797317 |
  5. |  4.19431          .    6.584174           . |
  6. | 7.221717   13.92533    5.045283    9.956708 |
  7. | 5.746871   14.26329    3.828253    8.725744 |
  8. | 1.396223    16.2358    19.01479    16.10277 |
  9. | 4.633088   13.95751           .           . |
 10. | 2.521546   4.490258   -3.396854     .422534 |
     +---------------------------------------------+

Alternatively, you could also use the inlist() function:

generate new_var = (((X)*3) + ((Y)*2) + ((Z)*4)) / 7 if !inlist(., X, Y, Z) 

3 Comments

Thanks Pearly. The problem is there are still missing values in the new var even when at least one observation is not missing. For example, if X is missing I want the new var to run the function for the values in Y and Z and produce a value instead of considering it missing. New var should only have missing values when all three X Y Z are missing. Do you have any suggestion on how that might be acheived?
It can be achieved by using the first command in @NickCox's answer (which you should also accept using the check-mark if it solves your problem).
It does, only that it treats the missing values as zero. I suppose I can add a replace command to change them to missing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.