How to create multiple files from a single file programmatically

Question

I have a dataset with observations about households; within each household there are individuals. The number of individuals per household differs. Households are identified with an id and members of the household are identified according to the order they were interviewed. So if household 1 had 4 members, the variable id is the same across all of them, but variable order goes from 1 to 4. The problem I have is that, for some variables, only the first member of the household answered for the rest of the members; therefore I have a mixture of long and wide format within my dataset.

What I need to do is to assign to the correspondent members of the household the values that were answered by the first member of the household. To explain further the structure of my data I´ll give the following toy example:

clear
input    ///
    id order age o_a1 o_a2  v1a1 v1a2  v2a1   v2a2  o_b1 o_b2  v1b1 v1b2 v2b1 v2b2 
      1  1   54    1     .   50    .    100      .   .     .     .    .    .    .
      1  2   50    .     .    .    .      .      .   .     .     .    .    .    .
      1  3   27    .     .    .    .      .      .   .     .     .    .    .    .
      1  4   18    .     .    .    .      .      .   .     .     .    .    .    .
      2  1   60    3     4   70   23     10     15   2     5    80   90  100  140
      2  2   72    .     .    .    .      .      .   .     .     .    .    .    .
      2  3   58    .     .    .    .      .      .   .     .     .    .    .    .
      2  4   20    .     .    .    .      .      .   .     .     .    .    .    .
      2  5   23    .     .    .    .      .      .   .     .     .    .    .    .
end

In the toy example from above, I have a household level variable id and individual level variables: order corresponds to the order of the individual in the household; age is their age. The other variables correspond to debts. A household can report at most two debts for each type of debt. In this case there are two types of debt, debt a or b'.

o_a1 gives the order of the member of the household with the first debt of type a. If we look at row 5 in the dataset, o_a1 is 3, meaning that the individual of the household with that debt is the third individual of the household, that is row 7, the one aged 58. Similarly, o_a2 indicates the order of the individual with the second debt.

v1a and v2a correspond to the first variable of debt a, for example size of debt in dollars. This means that member 3 of household 2 would be in debt for $70 and individual 4 of household 2 would be in debt for $23. Variables v2a1 and v2a2 correspond to a second variable of the debt and so on.

Then we have another type of debt, debt b, and the logic is the same as before.

In reality the data contains many variables for each debt, and many more types of debt (educational, housing, credit, credit cards, etc), as I´m not still sure which debts I´m going to study yet, I want to store the information from each type of debt in different datasets, and then merge the data of my interest using the variables id and order as identifiers. So I want to have a table for each debt, and keep the variables of the individual (in this case their age) in other tables as well. In the actual dataset other variables include sex, educational level, etc.

I already managed to do this for a couple of debts, but since there are many, I would like to know if there is a way to do this programmatically.

I´ll show what I did for each type of debt.

1) I kept the variables for a certain debt and only for order == 1. In the case of debt a, I only kept variables o_a1 o_a2 v1a1 v1a2 v2a1 v2a2 and the identifiers id and order.

drop age
keep if order==1
keep id order *a*

2) I reshaped the data from wide to long to obtain the order of the individual of each household in long format so that each debt was in the row with its correspondent debtor.

reshape long o_a , i(id) j(ncred)

3) I saved the reshaped data in a new file.

save "debt_a.dta", replace

4) I created a dataset for each credit of type a.

4.1) I created a dataset for debt a1 and dropped the observations that were missing in the new created variable o_a.

use "debt_a.dta", clear
drop if o_a  == .

Next, I dropped the variables corresponding to debt a2, and only kept the rows belonging to credit a1 (ncred == 1).

drop *a2
keep if ncred==1

To be able in step 5 to append the dataset of debt a1 and debt a2, I erased the substring a1 from a1 debt variables. I did the same in step 4.2. for debt a2.

foreach var of varlist * {
    local newname : subinstr local var "a1" "", all
    if "`newname'" != "`var'" {
           rename `var' `newname'
    }
}
save "debt_a1.dta", replace

4.2) The same as step 4.1 but for debt a2.

use "debt_a.dta", clear
drop if o_a  == .  
drop *a1
keep if ncred==2

foreach var of varlist * {
    local newname : subinstr local var "a2" "", all
    if "`newname'" != "`var'" {
        rename `var' `newname'
    }
}
save "debt_a2.dta", replace

5) Then I appended both datasets.

use "debt_a1.dta", clear 
append using "debt_a2.dta"
drop ncred 
replace order = o_a 
drop o_a
sort id order
save "debts_a_long.dta", replace

So I ended with the following dataset:

id  order   v1  v2
 1  1   50  100
 2  3   70  10
 2  4   23  15

Therefore, now I can merge the individual debt data with other tables. Let's assume that the individual data table looked like this:

clear 
input ///
  id order age sex years_education
   1  1   54   1     12
   1  2   50   1     14 
   1  3   27   0      8
   1  4   18   1     12 
   2  1   60   0      6
   2  2   72   1      8
   2  3   58   0     12
   2  4   20   0     14
   2  5   23   1     17
end
save "individual.dta", replace

Hence, instead of having information only on age, I also have on sex and years of education.

Now I can merge the debt data with the individual 'sociodemographic' data. In this case, the merge corresponds to a left join in SQL.

use "debts_a_long.dta", clear
joinby id order  using "individual.dta"

Which ends in:

id  order   v1  v2  age sex years_education
1   1   50  100 54  1   12
2   3   70  10  58  0   12
2   4   23  15  20  0   14

That is why I want to create a table for each type of debt.

Is there a way to do this programmatically in Stata for each debt instead of writing the code several times?

First I would rename your variable, so each debt is identified with a number (or letter) at the end.. see findit renvars to rename them more efficiently. Then you can reshape you data with a variable representing the type of debt, so it would be easier to do a routine for each debt.. you should also use tempfile for temporary dta.. like debt_a1.dta etc.. — timat
– timat, Commented Nov 10, 2016 at 12:44
Thanks for the mention of renvars (I am an author) but rename groups has superseded it as of Stata 12. See example in my own reply. — Nick Cox
– Nick Cox, Commented Nov 10, 2016 at 13:53

Nick Cox · Accepted Answer · 2016-11-10 13:49:02Z

This is clearly explained but equally a lot to grasp at once. It's best to ask one question at a time!

I'll make some initial comments and may extend this answer as time permits, unless naturally others post complete or better answers.

As a strategic comment: The impulse to subdivide the dataset into different datasets seems misguided here. Sure, you want to give analyses for different parts of the data, but I see no over-arching reason why multiple datasets will make that easier.

Similarly, there is a lot of emphasis on merge, joinby, and so forth, but I don't get the impression that they are required.

Here is token code to show how to copy information from the individual with order 1 to other observations:

clear
input id order age o_a1 o_a2 v1a1 v1a2 v2a1 v2a2 o_b1 o_b2 v1b1 v1b2 v2b1 v2b2 
1 1 54 1 . 50 . 100 . . . . . . .
1 2 50 . . . . . . . . . . . .
1 3 27 . . . . . . . . . . . .
1 4 18 . . . . . . . . . . . .
2 1 60 3 4 70 23 10 15 2 5 80 90 100 140
2 2 72 . . . . . . . . . . . .
2 3 58 . . . . . . . . . . . .
2 4 20 . . . . . . . . . . . .
2 5 23 . . . . . . . . . . . .
end 

forval j = 1/2 { 
    bysort id (order) : replace v1a`j' = v1a`j'[1] if order == o_a`j'[1] 
} 

list id order age *a1 *a2, sepby(id) 

     +------------------------------------------------------------+
     | id   order   age   o_a1   v1a1   v2a1   o_a2   v1a2   v2a2 |
     |------------------------------------------------------------|
  1. |  1       1    54      1     50    100      .      .      . |
  2. |  1       2    50      .      .      .      .      .      . |
  3. |  1       3    27      .      .      .      .      .      . |
  4. |  1       4    18      .      .      .      .      .      . |
     |------------------------------------------------------------|
  5. |  2       1    60      3     70     10      4     23     15 |
  6. |  2       2    72      .      .      .      .      .      . |
  7. |  2       3    58      .     70      .      .      .      . |
  8. |  2       4    20      .      .      .      .     23      . |
  9. |  2       5    23      .      .      .      .      .      . |
     +------------------------------------------------------------+

Note: To remove a suffix from a bundle of variable names, you don't need a loop.

rename *a1 *

strips the suffix a1 wherever it exists (subject to that request not being problematic otherwise).

Thanks, this is definitely a lot more simple than my approach. I supposed I wanted to follow a SQL approach since I just learn about it and wanted to practice it.
Conversely I don't know SQL terminology at all although I may know the Stata syntax equivalents for some operations.

Collectives™ on Stack Overflow

How to create multiple files from a single file programmatically

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related