SyntaxHighlighter

Sunday, March 29, 2015

Shapefile Import and Census Maps

Click for larger image

I was inspired to write this blog entry after reading the article Here are the fastest growing and fastest shrinking counties in the US by Andy Kiersz. Because I am using SAS 9.2 TS2M3, the maps.uscounty data set does not reflect all county FIPS codes thus requiring creation of my own version from a shapefile. The missing county FIPS values can be identified here. Starting in version 9.3, SAS started using Gfk map data sets that are up to date and do reflect the counties missing in traditional map data sets.

The first step in the SAS program (see below) is to use an URL access method to directly read in the county level data from the Census web site. After that, I used proc mapimport to convert a shapefile to a SAS map data set. Because I only wanted to show the 50 states, FIPS defined territories were removed from the imported data.

The imported data is written to separate data sets for the 48 contiguous states, Alaska and Hawaii. After that, each of those data sets are projected for proper display on a map. Alaska is reduced in size by 60% and then both Alaska and Hawaii are repositioned under the southwestern states. Finally all projected data sets are combined together and then sorted by state.

Proc gremove is used to get the state outlines and then use the annotate facility to draw thicker state outlines. Annotation is also used to write values to the center of each state.

A framed shadowed legend is created and positioned to the right of the map output. I utilized a technique learned from SAS/Graph expert, Robert Allison, to create a blank -90 degree angled title to allow space for the legend.

The last step is to create choropleth maps using proc gmap. Both the state and county output use the same color patterns and levels (10). You can see the full PDF version of the output here.

%macro populationchange ;
  %local states statescount i colors colorscount ;
  %let states = states48 alaska hawaii ;
  %let statescount = %sysfunc( countw( &states. ) ) ;
  %let colors = cb0020 df4949 ef9277 f5c0a9 f6e4dd 
                e0ebf1 b3d5e6 82bbd8 4196c7 0471b0 ;
  %let colorscount = %sysfunc( countw( &colors. ) ) ;

  filename census url 
  "http://www.census.gov/popest/data/counties/totals/2014/files/CO-EST2014-alldata.csv" ;
  data statedata countydata ;
    length state county 5  ;
    infile census dsd dlm = "," truncover firstobs = 2 ;
    input sumlev region division state county stname $ ctyname : $32. 
          pop10 estb10 pest10 pet11 pest12 pest13 pest14 ;
    pctchg = ( ( pest14 - pest13 ) / pest13 ) ;
    format pctchg percent8.2 ;
    if county = 0 then output statedata ;
    else output countydata ;
    keep state county pctchg ;
  run ;
  filename census clear ;

  proc mapimport datafile = 'c:\temp\cb\cb_2013_us_county_500k.shp' 
    out = shapefile ;
    select geoid ;
  run ;

  data alaska hawaii states48 ;
    length state county 5 ;
    set shapefile ;
    where segment < 3 and not missing( x ) ;

    state = input( substr( geoid, 1, 2 ), 2. ) ;
    if state > 56 or state in ( 7, 14, 43, 52 ) then delete ;
    county = input( substr( geoid, 3, 3 ), 3. ) ;

    if sign( x ) = 1 then x = -x ;
         if state =  2 then output alaska;
    else if state = 15 then output hawaii ;
    else                    output states48 ;
    drop geoid ;
  run ;

  %do i = 1 %to &statescount. ;
    /* Project each map data set separately. */
    proc gproject data = %scan( &states., &i. ) 
      out  = projected%scan( &states., &i. )
      eastlong deg ;
      id state county ;
    run;
  %end ;

  data projectedalaska ;
    set projectedalaska ;
    x = ( x * .4 ) - .29629586 ;
    y = ( y * .4 ) - .142574262 ;
  run ;

  data projectedhawaii ;
    set projectedhawaii ;
    x = x -.31 ;
    y = y - .120521998 ;
    if x > -.3726 ;
  run ;

  %do i = 2 %to &statescount. ;
    proc append base=projectedstates48 data=projected%scan( &states., &i. ) ;
    run ;
  %end ;

  proc sort data = projectedstates48 out = counties ;
    by state ;
  run ;

  proc gremove data = counties out = states( where = ( segment <= 8 ) ) ;
    id county ;
    by state ;
  run ;

  data countiesoutline ;
    retain xsys ysys '2' when 'a' color 'black' size 2 style 'empty' ;
    length function $8 ;

    set states ;
      by state segment ;
    if first.state or first.segment then function = 'poly' ;
    else function = 'polycont' ;
  run ;

  data mapanno2 ;
    length function $8 text $16 size 8 ;
    retain xsys ysys '2' hsys '3' when 'a' style "'Albany AMT'" ;

    merge
      statedata  
      maps.uscenter( where = ( fipstate( state ) ne 'PR' ) ) ;
        by state ;

    text = catx( " / ", fipstate(state), strip(putn(pctchg,'percent7.2'))) ;
    lagocean = lag( ocean ) ;
    size = 1.5 ;

    if ocean = 'Y' then do ;
      function = 'label' ;
      position = '6' ;
      output ;
      function = 'move' ;
      output ;
    end ;

    position = '5' ;
    if ocean = 'N' then do ;
      if lagocean = 'Y' then do ;
        function = 'draw' ;
        size = round( 1.5 / 4, .01 ) ;
      end ;
      else do ;
        function = 'label' ;
        position = '2' ;
        text = fipstate( state ) ;
        output ;
        position = '5' ;
        text = putn( pctchg, 'percent7.2' )  ;
      end ;
    end ;
    output ;
  run ;

  options nodate nonumber orientation = landscape ;
  %do i = 1 %to &colorscount. ;
    pattern&i. v = s c = cx%scan( &colors., &i. ) ;
  %end ;
  legend1 position = right across = 1 mode = share
    label = ( position = top font = "albany amt/bold" j = c 'Percent Change' ) 
    shape = bar( .12in, .12in ) cshadow = gray 
    offset = ( 2, 3 ) frame ;
 
  title1 h = 3 "2013 to 2014 Population Change";
  title5 a = -90 h = 4pct " ";
  footnote j=l "Source: US Census Bureau, 2014 Population Estimates" ;

  ods listing close ;
  ods pdf file = 'c:\temp\populationchange.pdf' notoc ;
    proc gmap data = statedata 
      map = maps.us( where = ( fipstate( state ) ne 'PR' ) ) all ;
      id state ;
      choro pctchg / anno = mapanno2 legend = legend1 levels = &colorscount. ;
    run ;
    
    proc gmap data = countydata map = counties all ;
      id state county ;
      choro pctchg / anno = countiesoutline legend = legend1  
        levels = &colorscount.  ;
    run ;
    quit ;
  ods pdf close ;
  ods listing ;
%mend ;
%populationchange

Saturday, March 14, 2015

Data Driven Report using the VVALUEX function

The SAS VVALUEX function can be used to display the contents of a variable based on another variable. VVALUEX is somewhat similar to a SAS macro variable but it will create a 200 byte character variable unless the variable has been previously defined with a length or attrib statement. This really comes into play when data driving a program to change a column's format or conditionally display an image as is done here. Be sure to click on the images to see larger versions.

The cardata data set contains raw data collected from various sources on the fastest automobile for each of the Big 3 auto makers. The below image reveals the contents of this data set.

The carmetadata table is used to define the label and format to be used along with the name of the variable to be resolved. A tilde (~) is being used as the ODS escapechar - adding the underscore (_) simulates a non breaking space which is how the 4 character indentation works. Here is what that looks like:

A data step is used to read in the carmetadata values and then look up the corresponding values in the cardata table using the vvaluex function. Each of the rows in the cardata table is read using the SET POINT= random access technique. I used point= over the more popular do until( eof ) via the set end= option so did not have to reset the end= variable for each loop. Here is what the final cars data set looks like.

You can read the below code to see how all this works and view the final rendered PDF file here.

data cardata ;
  input name : $16. hp zeroto60 mpgcity mpghighway topspeed 
        msrp weight qtrmiletime qtrmilespeed ;
  datalines ;
  Mustang 662 3.5 15 24 189 55935 3850 11.6 126
  Corvette 650 2.95 15 22 200 78995 3524 10.95 127
  Challenger 707 3.6 13 21 199 63980 4439 10.8 126
  ;
run ;

data carmetadata ;
  infile datalines truncover ;
  input id measure & $32. format : $16. var : $32. ;
  datalines ;
   1 
   2 Horsepower  comma9. hp 
   3 Curb Weight  comma9.  weight 
   4 Zero To 60  8.2 zeroto60
   5 Quarter Mile  $32.
   6 ~_~_~_~_Time  8.2 qtrmiletime
   7 ~_~_~_~_Speed  comma9. qtrmilespeed 
   8 Miles Per Gallon  $32.   
   9 ~_~_~_~_City  comma9. mpgcity   
  10 ~_~_~_~_Highway  comma9. mpghighway 
  11 Top Speed  comma9. topspeed   
  12 Retail Price  dollar9. msrp    
  ;
run ;

data cars( keep = format measure corvette challenger mustang ) ;
  attrib
    corvette   length = 8 label = 'Chevrolet Corvette Z06'
    challenger length = 8 label = 'Dodge Challenger SRT Hellcat'
    mustang    length = 8 label = 'Ford Mustang Shebly GT500' ;
    
  set carmetadata ;
  if not missing( var ) then do ;
    do i = 1 to totalobs ;
      set cardata point = i nobs = totalobs  ;
           if name='Corvette'   then corvette   = input(vvaluex(var), best.) ;
      else if name='Challenger' then challenger = input(vvaluex(var), best.) ;
      else if name='Mustang'    then mustang    = input(vvaluex(var), best.) ;
    end ;
  end ;
run ;

options nodate nonumber missing = '' ;
ods escapechar = "~" ;
ods listing close ;
ods pdf file = "%sysfunc(pathname(work))\cars.pdf" style=sasweb startpage=no ;
title "Fastest American Production Cars by Manufacturer" ;

proc report data = cars ;
  column format measure challenger corvette mustang ;
  define format     / noprint ;
  define measure    / display '' ;
  define challenger / display ;
  define corvette   / display ;
  define mustang    / display ;
  compute corvette ;
    if missing( measure ) and missing( corvette ) then call define( _col_, 
      'style', 'style=[preimage="c:\temp\corvette500.png"]') ;
    else call define( _col_, 'format', format ) ;
  endcomp ;
  compute challenger ;
    if missing( measure ) and missing( challenger ) then call define( _col_, 
      'style', 'style=[preimage="c:\temp\challenger500.png"]') ;
    else call define( _col_, 'format', format ) ;
  endcomp ;
  compute mustang ;
    if missing( measure ) and missing( mustang ) then call define( _col_, 
      'style', 'style=[preimage="c:\temp\mustang500.png"]') ;
    else call define( _col_, 'format', format ) ;
  endcomp ;
run ;

title "Performance Results" ;
proc sgplot data = cardata ;
  hbar name / 
    response = qtrmiletime 
    legendlabel = 'Quarter Mile' 
    stat        = mean 
    datalabel  
    dataskin    = gloss
    fillattrs   = graphdatadefault
  ;
  hbar name / 
    response    = zeroto60 
    legendlabel = 'Zero to 60mph' 
    stat        = mean 
    datalabel 
    barwidth    = 0.6  
    dataskin    = gloss
    fillattrs   = ( color = cx7c95ca )
  ;
  xaxis label   = 'Seconds' ;
  yaxis display = ( nolabel ) ;
run ;

ods pdf close ;
ods listing ;

Monday, March 2, 2015

SAS Update vs Hash Object

After my last post, SAS vs SQL Update, I received a comment from SAS guru, Paul Dorfman. Paul suggested that unsorted data (more realistic) would process faster using a hash object than the SAS update statement.

This time there are 20 million rows and more columns in the main table and about one million in the upd table to better illustrate differences. The results of this test show that the hash object (21.52 seconds) is about three times faster than the SAS update statement (60.17 seconds). When looking at the graph, the purple stack represents the main sort (37.45) and the red area is the upd sort (0.59).

Of course, hash objects load data into memory and do not require sorting while the update statement does. Even after removing the sort operations, the hash object was slightly faster than the update statement (21.52 vs 22.13 using real time).

The hash object's FIND method returns a value [ _iorc_ does not get added to the program data vector (PDV) ] that indicates whether the key is in the hash object. If the key is in the hash object, then the FIND method also sets the data variable to the value of the data item so that it is available for use after the method call.

For more information on hash objects, check out this excellent paper by Paul Dorfman: Data Step Hash Objects as Programming Tools.

data 
  main 
  upd( keep = id lname ) 
; 
  length 
    id       8
    fname 
    lname    $16
    address  $32
    city     $16
    state    $2
    zip      $9 
    phone    $10
  ;

  fname   = 'Alexandra' ;
  lname   = 'Smith' ; 
  address = '12345 Southwest Long Street' ;
  city    = 'Overland Park' ;
  state   = 'KS' ;
  zip     = '66213-12345' ;
  phone   = '9131112222' ;

  do id = 2e7 to 1 by -1 ;
    output main ; 
    if ranuni( 1 ) <= .05 then do ; 
      lname = 'Jones' ; 
      output upd ; 
    end ; 
  end ; 
run ; 

proc sort data = main ; 
  by id ; 
run ; 

proc sort data = upd ; 
  by id ; 
run ; 

data sasupdate  ; 
  update main upd ; 
   by id ; 
run ; 

data hashupdate ; 
  if _n_ = 1 then do ; 
    dcl hash h (dataset: "upd", hashexp: 20) ; 
    h.definekey( "id" ) ; 
    h.definedata( "lname" ) ; 
    h.definedone() ; 
  end ; 
  set main ; 
  /* _iorc_ is an automatic variable that is not added to the data set */
  _iorc_ = h.find() ; 
run ; 

/*SGPlot code */
proc sgplot data = x ;
  title 'SAS Update vs Hash Object on 1M of 20M rows' ;
  vbar method / 
    response = time 
    group    = task 
    dataskin = gloss 
    stat     = sum 
    datalabel  ;
  xaxis display = ( nolabel ) ;
  yaxis grid ;
run ;