Pages

SyntaxHighlighter

Monday, December 7, 2020

DOW Sort

Sorting data in ascending or descending order is a very easy and straightforward process.  However, there are times when a simple sort does not suffice.  Consider a scenario of submitting a large number of processes in a SAS grid environment that has a limit on the number of active parallel processes.

In that scenario, you will want to mix tasks that take a long time with those quicker running tasks.  There is normally a key driver that can be used to identify longer running processes.  The number of rows and columns are prime candidates to sort the data.

This example uses the available SASHELP.CLASS data set and sorts it by age and height in ascending order.  The ability to read SAS data sets in random order via the POINT= option in a SET statement make this relatively easy.  The idea is to read the highest or last row with the first X rows that are lower in value.  The technique used is a variant of the DOW Loop created and popularized by SAS Gurus Don Henderson, Paul Dorfman and Ian Whitlock.

The below data set is sorted by age and height.



The following data set shows the result of reading the last row then the top 5 rows.  After that the second to the last row is output followed by rows 6 through 10



/*******************************************************************************

     Program: sort_dow.sas

      Author: Tom Bellmer

     Created: 12/07/2020 @ 1:25:37 PM

SAS Version: SAS 9.4 (TS1M6)

          OS: LIN X64

     Purpose: Places subrows under the highest value via a DOW loop

       Notes: Assumes the input data set is sorted in ascending order

              will be used to stuff low value rows below high for RSUBMITS

              to reduce expected time in each parallel session

 

                             Modifications in descending order

FL-YYYYMMDD                             Description

----------- --------------------------------------------------------------------

 

         1    1    2    2    3    3    4    4    5    5    6    6    7    7    8

....5....0....5....0....5....0....5....0....5....0....5....0....5....0....5....0

*******************************************************************************/

 

%macro sort_dow(

    dsnin   =

  , subrows = 5

  , dsnout  = sort_dow

);

 

  %if not %sysfunc(exist(&dsnin)) %then %do;

    %put %str(E)RROR: Invalid input data set: &dsnin..;

    %return;

  %end;

 

  data &dsnout;

    if 0 then set &dsnin nobs = totobs;

 

    do until(totobs = i);

      set &dsnin point = totobs;

      totobs = totobs - 1;

      output;

      if totobs = i then stop;

 

      do _n_ = 1 to &subrows;

        i + 1;

        set &dsnin point = i;

        output;

        if totobs = i then stop;

      end;

    end;

    stop;

  run;

%mend;

 

/*EOF: sort_dow.sas */

 

 

 

proc sort data = sashelp.class out = class;

  by age height;

run;

 

%sort_dow(

    dsnin   = class

  , subrows = 5

  , dsnout  = sort_dow

);

Monday, October 26, 2020

FREQ N Faster

 


I needed to capture and count the unique value of each character column in SAS data sets.  My first thought was to use PROC FREQ and pass all the character variables to the TABLE statement.

Unfortunately, I quickly ran into an out of memory issue and was unable to specify a memsize setting due to established constraints.  It is a shame to have to call PROC FREQ once for each variable but that approach did get me past the out of memory issue.

The best way to test different approaches was to create some test data.  That is the purpose of the createsampledata macro - source code can be found below.  That code created random upper case characters between ASCII values 65 and 90 (A - Z).  In order to simulate real world data better, the length of each value can fluctuate from 0 to 16 characters.  Instead of making each value unique, proc append is used to copy in existing rows to again better simulate real world data.

I decided to try a hybrid approach to push some, but not all variables to the TABLE statements.  The below code shows the results using the sample test data. Your mileage may vary based on the number of rows processed, character length and the cardinality of the data.

As you can see in the below chart, the sweet spot for this set of data is to use 10 columns in the table statement.  After that the elapsed time increases.  Try utilizing this technique to reduce the amount of time it takes to process your freqn data.



%macro createsampledata(

    rows      = 1000000

  , byvalue   = 1000

);

  %local i;

  proc datasets

    lib = work

    nolist

    kill

  ;

  quit;

 

  data results;

    attrib

      stepvalue length = 3

      duration  length = 8 format = time12.3

    ;

    call missing(of _all_);

    stop;

  run;

   

  /* Create sample data */

  data tempdata(drop = t_:);

    array dummy {*} $16 char_1 - char_100;

    call streaminit(123);  

    

    do t_0 = 1 to &rows by &byvalue;

      do t_1 = 1 to dim(dummy);

        do t_2 = 1 to int(17 * rand('uniform'));

          substr(dummy[t_1], t_2) = byte(int(65 + 26 * rand('uniform')));

        end;

      end;

      output;

    end;

  run; 

  

  /* Not everything is unique so add via byvalue */

  %do i = 1 %to &byvalue;

    proc append

      base = sampledata

      data = tempdata

    ;

    run;

  %end;

%mend;

 

%macro freqnfaster(stepvalue = 1);

  %local i j starttime columncount;

 

  proc sql noprint;

    select      name

    into        :name1 -

    from        dictionary.columns

    where           libname = "WORK"

                and memname = "SAMPLEDATA"

                and type = 'char'

    ;

  quit;

  %let columncount = &sqlobs; 

  

  %let starttime = %sysfunc(datetime());

  %do i = 1 %to &columncount %by &stepvalue;

    proc freq

      data  = work.sampledata

      order = freq

      noprint

    ;

    %do j = &i %to %sysfunc(min(%eval(&i + &stepvalue -1), &columncount));

      table &&name&j / missing out = work.freqout&j;

    %end;

    run;

  %end;

 

  proc sql;

    insert into work.results

      set

          stepvalue = &stepvalue

        , duration  = (datetime() - &starttime)

    ;

  quit;

%mend;

 

 

%createsampledata(

    rows      = 1000000

  , byvalue   = 1000

);

 

%freqnfaster(stepvalue = 1)

%freqnfaster(stepvalue = 2)

%freqnfaster(stepvalue = 3)

%freqnfaster(stepvalue = 5)

%freqnfaster(stepvalue = 8)

%freqnfaster(stepvalue = 10)

%freqnfaster(stepvalue = 15)

%freqnfaster(stepvalue = 20)

%freqnfaster(stepvalue = 25)

%freqnfaster(stepvalue = 30)

%freqnfaster(stepvalue = 50)

%freqnfaster(stepvalue = 100)

 

ods _all_ close;

ods listing gpath = "/myfolder";

ods graphics / reset = index imagename = 'freqnfast';

title 'PROC FREQ Results of 1M rows and 100 columns';

proc sgplot data = results;

  hbar stepvalue /

    response = duration

    stat = mean

    datalabel

  ;

  xaxis grid label = 'Duration';

  yaxis grid label = 'Step Value';

run;

Tuesday, August 11, 2020

find text in a file

 

Over time we tend to collect a large number of program files.  It never fails that after much time has past that you need to access some code in a file but can't recall which program the keyword was in.

I work on a Linux operating system so I can use the GREP command to handle the task but I never remember the syntax options and do not like the way the data is returned.  That is why I wrote a macro named %findtext() that handles this for you in SAS via the use of a PIPE command.

If want to work for the keyword 'bitwise' in all SAS programs in a folder and any sub-folders that GREP command will be as follows:

> grep -nRi 'updatestagingdetails' '/em_data1/prod/macro' --include=*.sas

/em_data1/prod/macro/etl.sas:524:       %updatestagingdetails(env_out=&env_out.);

/em_data1/prod/macro/updatestagingdetails.sas:2:*     Program: updatestagingdetails.sas

/em_data1/prod/macro/updatestagingdetails.sas:9:*       Usage: %updatestagingdetails(dsn=work.stagingdetails,env_out=%env())

/em_data1/prod/macro/updatestagingdetails.sas:21:%macro updatestagingdetails( dsn  = work.stagingdetails, env_out = %env() ) ;

/em_data1/prod/macro/updatestagingdetails.sas:40:/*EOF: updatestagingdetails.sas */

Notice in the above output that the values are separated by a colon (:) and it is overall hard to read in my opinion.  This is where the SAS SCAN and FIND functions come into play to create a cleaner easier to read result.

The SAS macro call will be as follows (only the first two parameters are required):

%findtext(

  path      = /em_data1/prod/macro

 , text     = updatestagingdetails 

  , ignorecase   = Y

  , exactmatch   = N

  , extension    = sas

  , outdsn       = findtextresults

  , printresults = Y

);


 

/*******************************************************************************

     Program: findtext.sas

      Author: Tom Bellmer

Responsible: Tom Bellmer

     Created: 08/10/2020 @ 3:00:45 PM

SAS Version: SAS 9.4 (TS1M6)

          OS: LIN X64

     Purpose: read the contents of files searching for the string value

       Usage: %findtext(path = /folder, string = sql);

       Notes: colon is used as output separator.  Scan() function had issues

              with third column that contained colons.

  Parameters: path = (folder path to be searched)

              text = (text to be searched inside each file)

              ignorecase = Y/N to ignore the case of the &string.

              exactmatch = N/Y to match whole &string value

              extension = sas (file extension, use * for all files)

              outdsn = findtextresults (output data set name)

              printresults = Y (proc print the results? Y/N)

 

                             Modifications in descending order

FL-YYYYMMDD                             Description

----------- --------------------------------------------------------------------

 

         1    1    2    2    3    3    4    4    5    5    6    6    7    7    8

....5....0....5....0....5....0....5....0....5....0....5....0....5....0....5....0

*******************************************************************************/

 

%macro findtext(

    path         =

  , text         =

  , ignorecase   = Y

  , exactmatch   = N

  , extension    = sas

  , outdsn       = findtextresults

  , printresults = Y

);

 

  %local options i;

  %do i = 1 %to 100;

    %local pathname&i;

  %end;

 

  %if %isblank(&path) %then %do;

    %put %str(E)RROR: Must pass in a value for path.;

    %return;

  %end;

 

  %if %isblank(&text) %then %do;

    %put %str(E)RROR: Must pass in a value for text.;

    %return;

  %end;

 

  /*   i = ignores case, n = show matched line number

     , R = recursively,  w = match whole word  */

  %let options = -nR;

  %if %upcase(&ignorecase) = Y %then %let options = &options.i;

  %if %upcase(&exactmatch) = Y %then %let options = &options.w;

 

  filename search pipe "grep &options '&text' '&path' --include='*.&extension'";

  data &outdsn(drop = fullfilename findpos);

    attrib

      Path         length = $256 label = 'Path'

      Filename     length = $128 label = 'Filename'

      Lineno       length = 8    label = 'Line No'

      Code         length = $256 label = 'Code'

      Fullfilename length = $256 label = 'Full Filename'

    ;

 

    infile search;

    input;

    fullfilename = scan(_infile_, 1':');

    filename     = scan(fullfilename, -1"/");

    path         = substr(_infile_, 1, lengthn(fullfilename) - lengthn(filename));

    lineno       = input(scan(_infile_, 2':'), 8.);

    findpos      = find(_infile_, ":");

    findpos      = find(_infile_, ":", findpos + 1);

    code         = substr(_infile_, findpos + 1);

  run;

  filename search clear;

 

  %if %getattr(dsn = &outdsn, attr = nlobs) = 0 %then %do;

    proc sql;

      insert into &outdsn

        set path     = "&path"

          , filename = "N/A"

          , lineno   = 0

          , code     = "No results were found using '&text'"

      ;

    quit;

  %end;

 

  %if %upcase(%substr(&printresults, 11)) = Y %then %do;

    proc sql noprint;

      select   distinct path

      into     :pathname1 -

      from     &outdsn;

    quit;

   

    title "Search results for: &text";

    %if &sqlobs = 1 %then %do;

      title2 "In Path: &pathname1";

    %end;

    proc print data = &outdsn;

      %if &sqlobs = 1 %then %do;

         var filename lineno code;

      %end;

    run;

    title;

  %end;

%mend;

 

/*EOF: findtext.sas */