I have a dataset with observations about households; within each household there are individuals. The number of individuals per household differs. Households are identified with an id and members of the household are identified according to the order they were interviewed. So if household 1 had 4 members, the variable id is the same across all of them, but variable order goes from 1 to 4. The problem I have is that, for some variables, only the first member of the household answered for the rest of the members; therefore I have a mixture of long and wide format within my dataset.
What I need to do is to assign to the correspondent members of the household the values that were answered by the first member of the household. To explain further the structure of my data I´ll give the following toy example:
clear
input ///
id order age o_a1 o_a2 v1a1 v1a2 v2a1 v2a2 o_b1 o_b2 v1b1 v1b2 v2b1 v2b2
1 1 54 1 . 50 . 100 . . . . . . .
1 2 50 . . . . . . . . . . . .
1 3 27 . . . . . . . . . . . .
1 4 18 . . . . . . . . . . . .
2 1 60 3 4 70 23 10 15 2 5 80 90 100 140
2 2 72 . . . . . . . . . . . .
2 3 58 . . . . . . . . . . . .
2 4 20 . . . . . . . . . . . .
2 5 23 . . . . . . . . . . . .
end
In the toy example from above, I have a household level variable id and individual level variables: order corresponds to the order of the individual in the household; age is their age. The other variables correspond to debts. A household can report at most two debts for each type of debt. In this case there are two types of debt, debt a or b'.
o_a1 gives the order of the member of the household with the first debt of type a. If we look at row 5 in the dataset, o_a1 is 3, meaning that the individual of the household with that debt is the third individual of the household, that is row 7, the one aged 58. Similarly, o_a2 indicates the order of the individual with the second debt.
v1a and v2a correspond to the first variable of debt a, for example size of debt in dollars. This means that member 3 of household 2 would be in debt for $70 and individual 4 of household 2 would be in debt for $23. Variables v2a1 and v2a2 correspond to a second variable of the debt and so on.
Then we have another type of debt, debt b, and the logic is the same as before.
In reality the data contains many variables for each debt, and many more types of debt (educational, housing, credit, credit cards, etc), as I´m not still sure which debts I´m going to study yet, I want to store the information from each type of debt in different datasets, and then merge the data of my interest using the variables id and order as identifiers. So I want to have a table for each debt, and keep the variables of the individual (in this case their age) in other tables as well. In the actual dataset other variables include sex, educational level, etc.
I already managed to do this for a couple of debts, but since there are many, I would like to know if there is a way to do this programmatically.
I´ll show what I did for each type of debt.
1) I kept the variables for a certain debt and only for order == 1. In the case of debt a, I only kept variables o_a1 o_a2 v1a1 v1a2 v2a1 v2a2 and the identifiers id and order.
drop age
keep if order==1
keep id order *a*
2) I reshaped the data from wide to long to obtain the order of the individual of each household in long format so that each debt was in the row with its correspondent debtor.
reshape long o_a , i(id) j(ncred)
3) I saved the reshaped data in a new file.
save "debt_a.dta", replace
4) I created a dataset for each credit of type a.
4.1) I created a dataset for debt a1 and dropped the observations that were missing in the new created variable o_a.
use "debt_a.dta", clear
drop if o_a == .
Next, I dropped the variables corresponding to debt a2, and only kept the rows belonging to credit a1 (ncred == 1).
drop *a2
keep if ncred==1
To be able in step 5 to append the dataset of debt a1 and debt a2, I erased the substring a1 from a1 debt variables. I did the same in step 4.2. for debt a2.
foreach var of varlist * {
local newname : subinstr local var "a1" "", all
if "`newname'" != "`var'" {
rename `var' `newname'
}
}
save "debt_a1.dta", replace
4.2) The same as step 4.1 but for debt a2.
use "debt_a.dta", clear
drop if o_a == .
drop *a1
keep if ncred==2
foreach var of varlist * {
local newname : subinstr local var "a2" "", all
if "`newname'" != "`var'" {
rename `var' `newname'
}
}
save "debt_a2.dta", replace
5) Then I appended both datasets.
use "debt_a1.dta", clear
append using "debt_a2.dta"
drop ncred
replace order = o_a
drop o_a
sort id order
save "debts_a_long.dta", replace
So I ended with the following dataset:
id order v1 v2
1 1 50 100
2 3 70 10
2 4 23 15
Therefore, now I can merge the individual debt data with other tables. Let's assume that the individual data table looked like this:
clear
input ///
id order age sex years_education
1 1 54 1 12
1 2 50 1 14
1 3 27 0 8
1 4 18 1 12
2 1 60 0 6
2 2 72 1 8
2 3 58 0 12
2 4 20 0 14
2 5 23 1 17
end
save "individual.dta", replace
Hence, instead of having information only on age, I also have on sex and years of education.
Now I can merge the debt data with the individual 'sociodemographic' data. In this case, the merge corresponds to a left join in SQL.
use "debts_a_long.dta", clear
joinby id order using "individual.dta"
Which ends in:
id order v1 v2 age sex years_education
1 1 50 100 54 1 12
2 3 70 10 58 0 12
2 4 23 15 20 0 14
That is why I want to create a table for each type of debt.
Is there a way to do this programmatically in Stata for each debt instead of writing the code several times?
renvarsto rename them more efficiently. Then you can reshape you data with a variable representing the type of debt, so it would be easier to do a routine for each debt.. you should also usetempfilefor temporary dta.. likedebt_a1.dtaetc..renvars(I am an author) butrename groupshas superseded it as of Stata 12. See example in my own reply.