Pages

SyntaxHighlighter

Friday, June 25, 2021

Creating Sample Data

Sample data plays important roles in testing and benchmarking applications. We need to be mindful of personally identifiable information or PII to not reveal actual user data. SAS does provide procedures such as surveyselect to create extended sample data from existing data but that can be risky if PII is not properly addressed. The use of SAS supplied test data such as SASHELP.CLASS or SASHELP.BASEBALL and the like are just too small to be of any value, especially when benchmarking.

As a result, I have created a macro named sampledata that creates fictious data of any size you like in terms of rows. You can modify the code to your desires to expand the number of columns but this gives you a very good start and explains the use of random numbers when applied to the SAS byte() function.

It is important to understand the American Standard Code for Information Interchange or ASCII character set. As it turns out my very first blog post from 10 years ago was on the creation of the printable ASCII values - see below (click to see a clearer image).

As you can see, numeric values 0 - 9 equate to decimal 48 to 57 in ASCII. If you use the SAS byte() function on ASCII 57 as BYTE(57) it will return a 9. Similarly upper case alphabet characters A - Z are represented as ASCII values 65 - 90. Finally, lowercase a - z are the same as ASCII 97 - 122.

Another critical consideration to create the data is to understand that SAS uses numbers to reprsent dates. SAS dates start with zero (0) on January 1, 1960 and increase or decrease from that base value. You will see a fake date of birth (dob) value has been created for each row. I chose a 60 year time span from 1930 - 1990 based on the SAS date methodology - see below. A span of 30 years in SAS is 365.25 * 30 = 10957 rounded. In order to expand this to a 60 year time period, I used the RAND() function and if the return value was less than 0.5 the value was multiplied by a positive 1 otherwise it was multiplied by a negative 1.

The most important piece of this code is the use of the RAND() function. CALL steaminit() is used to specify a starting seed value used by the RAND() function. If a positive number is used in STREAMINIT() the sequence is repeatable. The RAND() function will return a fractional value between 0 and 1 that you can then multiply by whatever value you like. You can see that in the case of both the firstname and lastname values that each are 16 characters in length so I used byte(int(97 + 26 * rand('uniform')) to generate lower case characters (the range A - Z is 26) and lower case ASCII values start at 97. An outer loop was used to create a name that can be up to 16 characters in length. I set a minimum length of 3 so the loop will be used anywhere between 3 and 16 times to create the names.

The distribution of sex ('F' / 'M') is set at 50.8% female. This ratio was taken from the Census bureau. The SAS IFC() function came in handy here.

I wanted to add some reality to this fake data and did that by using the SASHELP.ZIPCODE data set and augmented it by adding in an incrementing key value (1 to number of rows). This makes it easy to randomly extract a zip code from the hash object. I used the hash object reference of H along with its num_items attribute as h.num_items to return the total number of rows loaded into the hash object. Once a mathc is made (it has to in this scenario) then the associated satellite values are copied into the program data vector (PDV). The value column can be whatever you like but here it is set to a random value between a low side of 30,000 up to 1,000,000 so it could be the value of a house.

The results of running this code via %sampledata() will generate the output shown below. After the image, you will find the entire source code used to create the output.

%macro sampledata(
    rows   = 1000000
  , outdsn = work.sampledata
  , seed   = 123
);

  /* dsn of zip codes with incrementing key */
  data work.zipcodes;
    key + 1;
    set sashelp.zipcode(keep = zip city statecode areacode);
  run;

  data &outdsn(keep = id -- value);
    attrib
      id        length =   8
      firstname length = $16
      lastname  length = $16
      sex       length =  $1
      city      length = $35
      statecode length =  $2
      zip       length =   8
      phone     length = $10
      dob       length =   4 format = mmddyy10.
      age       length =   3
      value     length =   8 format = dollar12.
      areacode  length =   8
    ;

    dcl hash h(dataset: "work.zipcodes");
    h.definekey('key');
    h.definedata('zip', 'city', 'statecode', 'areacode');
    h.definedone();
      
    /* seed for random function */
    call streaminit(&seed);   

    do id = 1 to &rows;
      /* firstname can be between 3 and 16 characters long */
      t_firstlen = max(3, ceil(16 * rand('uniform')));
      do t_i = 1 to t_firstlen;
        if t_i = 1 then firstname = byte(int(65 + 26 * rand('uniform')));
        else substr(firstname, t_i, 1) = byte(int(97 + 26 * rand('uniform')));
      end;
      
      t_lastlen  = max(3, ceil(16 * rand('uniform')));
      do t_i = 1 to t_lastlen;
        if t_i = 1 then lastname = byte(int(65 + 26 * rand('uniform')));
        else substr(lastname, t_i, 1) = byte(int(97 + 26 * rand('uniform')));
      end;
      
      /* Females 50.8% per census:  https://census.gov/quickfacts */
      sex = ifc(rand('uniform') >= 0.508, 'F', 'M');
      
      key = ceil(h.num_items * rand('uniform'));
      if h.find() = 0 then do;
        phone = put(areacode, 3.);
        do t_i = 4 to 10;
          substr(phone, t_i, 1) = byte(int(48 + 10 * rand('uniform')));
        end;
      end;

             /* (365.25 * 30) = 10957 or 30 years of days */
      dob   = ceil(10957 * rand('uniform')) * ifn(rand('uniform') < .5, 1, -1);
      age   = int(yrdif(dob, date(), 'actual'));
      value = max(1000000 * rand('uniform'), 30000);
      output;
    end;

    stop;  /* terminates this data step */
  run;   
%mend;

1 comment:

  1. Neat utility! Do you think that ZIP code should be character or at least formatted with Z. for a leading zero as needed?

    ReplyDelete