After my last post, SAS vs SQL Update, I received a comment from SAS guru, Paul Dorfman. Paul suggested that unsorted data (more realistic) would process faster using a hash object than the SAS update statement.
This time there are 20 million rows and more columns in the main table and about one million in the upd table to better illustrate differences. The results of this test show that the hash object (21.52 seconds) is about three times faster than the SAS update statement (60.17 seconds). When looking at the graph, the purple stack represents the main sort (37.45) and the red area is the upd sort (0.59).
Of course, hash objects load data into memory and do not require sorting while the update statement does. Even after removing the sort operations, the hash object was slightly faster than the update statement (21.52 vs 22.13 using real time).
The hash object's FIND method returns a value [ _iorc_ does not get added to the program data vector (PDV) ] that indicates whether the key is in the hash object. If the key is in the hash object, then the FIND method also sets the data variable to the value of the data item so that it is available for use after the method call.
For more information on hash objects, check out this excellent paper by Paul Dorfman: Data Step Hash Objects as Programming Tools.
data main upd( keep = id lname ) ; length id 8 fname lname $16 address $32 city $16 state $2 zip $9 phone $10 ; fname = 'Alexandra' ; lname = 'Smith' ; address = '12345 Southwest Long Street' ; city = 'Overland Park' ; state = 'KS' ; zip = '66213-12345' ; phone = '9131112222' ; do id = 2e7 to 1 by -1 ; output main ; if ranuni( 1 ) <= .05 then do ; lname = 'Jones' ; output upd ; end ; end ; run ; proc sort data = main ; by id ; run ; proc sort data = upd ; by id ; run ; data sasupdate ; update main upd ; by id ; run ; data hashupdate ; if _n_ = 1 then do ; dcl hash h (dataset: "upd", hashexp: 20) ; h.definekey( "id" ) ; h.definedata( "lname" ) ; h.definedone() ; end ; set main ; /* _iorc_ is an automatic variable that is not added to the data set */ _iorc_ = h.find() ; run ; /*SGPlot code */ proc sgplot data = x ; title 'SAS Update vs Hash Object on 1M of 20M rows' ; vbar method / response = time group = task dataskin = gloss stat = sum datalabel ; xaxis display = ( nolabel ) ; yaxis grid ; run ;