Sampling for multiple columns at once and return to aggregated dataframe

Question

I have a dataset that looks like this:

Column 1   Column 2   Column 3    Column 4
  Male        35        USA         DC
  Female      10        USA         NYC

I've agregated this dataframe to calculate the number of unique values in each column and the respective percentage of the total number of rows.

So my new dataframe looks like this:

I've got a data frame that looks like this (this is just examplary):

  Column Name   Nominal  Percent 
1 Col1             3      1.00
2 Col2          69333    99.51
3 Col3          65766    94.40
4 Col4          60727    87.16

What I want for the second dataframe is to create a third column - sample modality. The new column should be a sample of each column. Like this:

  Column Name   Nominal  Percent  Sample_1
1 Col1             3       1.00     Male
2 Col2           69333     99.51    25

I can't recall how to pull this off automatically for each column. I don't want to manually type each column-name. Any hints?

newdat$Sample_1 <- sapply(origdat, sample, size=1)? Note that they will likely be upconverted to character (since at least one of your columns is character, none of them will retain their numeric or integer class. — r2evans
– r2evans, Commented Aug 28, 2018 at 16:24
@r2evans it took me a while to realize that I've arranged the aggregated dataframe. But it works perfectly :) Thank you! — Prometheus
– Prometheus, Commented Aug 28, 2018 at 16:36
Sorry, yes of course, the ordering is relevant. If that's an issue, then you can generate a temporary data.frame(ColumnName=names(origdat), Sample_1=sapply(origdat, sample, size=1), stringsAsFactors=FALSE), then use merge or any of the joins within dplyr. — r2evans
– r2evans, Commented Aug 28, 2018 at 16:38

r2evans · Accepted Answer · 2018-08-28 16:43:22Z

For completeness.

Data, slightly modified to make them consistent and R-friendly (no spaces):

origdat <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
Column_1   Column_2   Column_3    Column_4
  Male        35        USA         DC
  Female      10        USA         NYC')

newdat <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
  Column_Name   Nominal  Percent 
1 Column_1          3      1.00
2 Column_2       69333    99.51
3 Column_3       65766    94.40
4 Column_4       60727    87.16')

Verbose method, using a temporary data.frame to store the samplings:

set.seed(2)
tempdat <- data.frame(Column_Name = names(origdat),
                      Sample_1 = sapply(origdat, sample, size=1),
                      stringsAsFactors=FALSE)

Merging it in with base R:

merge(newdat, tempdat, by="Column_Name", all=TRUE)
#   Column_Name Nominal Percent Sample_1
# 1    Column_1       3    1.00     Male
# 2    Column_2   69333   99.51       10
# 3    Column_3   65766   94.40      USA
# 4    Column_4   60727   87.16       DC

Merging with dplyr:

dplyr::left_join(newdat, tempdat, by="Column_Name")
#   Column_Name Nominal Percent Sample_1
# 1    Column_1       3    1.00     Male
# 2    Column_2   69333   99.51       10
# 3    Column_3   65766   94.40      USA
# 4    Column_4   60727   87.16       DC

Collectives™ on Stack Overflow

Sampling for multiple columns at once and return to aggregated dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related