1

My source data contains 200,000+ observations, one of the many variables in the data set is "county." My goal is to write a macro that will take this one data set as an input, and split them into 58 different temporary data sets for each of the California counties.

First question is if it is possible to specify the 58 counties on the data statement using something like a global reference array defined beforehand.

Second question is, assuming the output data sets have been properly specified on the data statement, is it possible to use a do loop to choose the right data set to write to?

I can get the comparison to work properly, but cannot seem to use a array reference to specify a output data set. This is most likely because I need more experience with the macro environment!

Please see below for the simplistic skeleton framework I have written so far. c_long array contains the names of each of the counties, c_short array contains a 3 letter abbreviation for each of the counties. Thanks in advance!

data splitraw;
    length county_name $15;
    infile "&path/random.csv" dsd firstobs=2;
    input county_name $ number;
run;

%macro _58countysplit(dxtosplit,countycol);
data <need to specify 58 data sets here named something like &dxtosplit_ALA, &dxtosplit_ALP, etc..>;
    set &dxtosplit;
    do i=1 to 58;
        if c_long{i}=&countycol then output &dxtosplit._&c_short{i};
    end;
run;
%mend _58countysplit;

%_58countysplit(splitraw,county_name);
3
  • 3
    In SAS, there's nearly always a better option than splitting one large dataset into lots of little ones - as you've already found, it greatly complicates your code. What are you trying to achieve by doing this? Commented Jan 23, 2015 at 13:36
  • Strictly speaking there is no such thing as a macro variable array. There are ways to use them more or less like arrays, but there is no technical construct as such. Commented Jan 23, 2015 at 15:57
  • 1
    Hi everyone, thank you for your amazing replies. The reason I am splitting into 58 files is because these files are to be distributed to 58 different people, with only records for their particular county. It can be done in excel via VBA but is more complicated than SAS code, and is also limited to 1,048k rows Commented Jan 24, 2015 at 18:12

3 Answers 3

1

The code you provided will need to run through the large dataset 58 times, each time writing a small one. I have done it a bit different. First I create a sample dataset with a variable "county" this will contain ten different values:

data large;
  attrib county length=$12;
  do i=1 to 10000;
    county=put(mod(i,10)+1,ROMAN.);
    output;
  end;
run;

First, I start with finding all the unique values and constructing the names of all the different tables I would like to create:

proc sql noprint;
  select distinct compbl("large_"!!county) into :counties separated by " "
  from large;
quit;

Now I have a macrovariable "counties" that containes all the different datasets I want to create.

Here I am writing the IF-statements to a file:

filename x temp;
data _null_;
  attrib county length=$12 ds length=$18;
  file x;
  i=1;
  do while(scan("&counties",i," ") ne "");
    ds=scan("&counties",i," ");
    county=scan(ds,-1,"_");
    put "if county=""" county +(-1) """ then output " ds ";";
    i+1;
  end;
run;

Now I have what I need to create the small datasets:

data &counties;
  set large;
  %inc x;
run;
Sign up to request clarification or add additional context in comments.

1 Comment

I think his not-quite-finished-code was actually intended to work the way you have it above, the way I interpret it.
0

I agree with user667489, there is almost always a better way then splitting one large data set into many small data sets. However, if you want to proceed along these lines there is a table in sashelp called vcolumn which lists all your libraries, their tables, and each column (in each table) that should help you. Also if you want

if c_long{i}=&countycol then output &dxtosplit._&c_short{i};

to resolve you might mean:

if c_long{i}=&countycol then output &&dxtosplit._&c_short{i};

1 Comment

I couldn't SAS to recognize c_short{i} as a "data set name" even via hard code (if county="Alameda" then output c_short{1};) This is probably because the SAS array construct is meant to be a reference to variables and not truly used to "name" data sets? Thank you for your pointer to vcolumn, it has proven to be helpful
0

It's likely, depending upon what you're actually trying to do, that BY processing is all you need. Nevertheless, here is a simple solution:

    %macro split_by(data=, splitvar=);
        %local dslist iflist;


        proc sql noprint;   
            select distinct cats("&splitvar._", &splitvar) 
            into :dslist separated by ' ' 
            from &data;

            select distinct 
            catt("if &splitvar='", &splitvar, "' then output &splitvar._", &splitvar, ";", '0A'x) 
            into :iflist separated by "else "
            from &data;
        quit;

        data &dslist;
            set &data;
            &iflist
        run;        
    %mend split_by;

Here is some test data to illustrate:

options mprint;

data test;
    length county $1 val $1;
    input county val;
    infile cards;
    datalines;
A 2
B 3
A 5
C 8
C 9
D 10
run;

%split_by(data=test, splitvar=county)

And you can view the log to see how the macro generates the DATA step you want:

 MPRINT(SPLIT_BY):   proc sql noprint;
 MPRINT(SPLIT_BY):   select distinct cats("county_", county) into :dslist separated by ' ' from test;
 MPRINT(SPLIT_BY):   select distinct catt("if county='", county, "' then output county_", county, ";", '0A'x) into :iflist separated 
 by "else " from test;
 MPRINT(SPLIT_BY):   quit;
 NOTE: PROCEDURE SQL used (Total process time):
       real time           0.01 seconds
       cpu time            0.01 seconds


 MPRINT(SPLIT_BY):   data county_A county_B county_C county_D;
 MPRINT(SPLIT_BY):   set test;
 MPRINT(SPLIT_BY):   if county='A' then output county_A;
 MPRINT(SPLIT_BY):   else if county='B' then output county_B;
 MPRINT(SPLIT_BY):   else if county='C' then output county_C;
 MPRINT(SPLIT_BY):   else if county='D' then output county_D;
 MPRINT(SPLIT_BY):   run;

 NOTE: There were 6 observations read from the data set WORK.TEST.
 NOTE: The data set WORK.COUNTY_A has 2 observations and 2 variables.
 NOTE: The data set WORK.COUNTY_B has 1 observations and 2 variables.
 NOTE: The data set WORK.COUNTY_C has 2 observations and 2 variables.
 NOTE: The data set WORK.COUNTY_D has 1 observations and 2 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.03 seconds
       cpu time            0.05 seconds

1 Comment

Hi matthew, your simple solution does a great job of generating the code that I currently have hard coded, it works great! I had to adjust my county names (compress spaces) to make the names unique. I will use your solution as a base of what functions to study (namely proc sql). Thank you very much!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.