Saturday, May 28, 2016

Compress Non-Printable Special Characters

Have you ever been bitten by flies that you feel but can't see? Those flies are often referred to as no-see-ums.

You can also get stung by non-printable and special characters (NPSC) in computer programming. Non-printable characters are those that have an ASCII decimal value of between 0 and 31 as well as 127 (DEL character) in the standard ASCII character set. Special characters are the high level ASCII decimal codes from 128 thru 255 inclusive.

Those pesky invisible characters can wreak havoc in the results of a process (see this paper for additional information). What I found out was that the COMPRESS() function with the KW (keep writable) modifier did not work in my case. I had the ASCII decimal 160 character (AO in hex) as the first character in a variable.

As a result, I wrote a function using PROC FCMP to eradicate those annoying characters and replace them with nulls. Below is the source code and a test program to remove the no-see-ums.

Be sure to assign a LENGTH to the returned variable as the function will otherwise return a string that is 32,767 characters in length. The default return length for proc fcmp is 33 characters so I set it to the maximum width to avoid truncation.

proc fcmp outlib = work.funcs.utilities ; 
  function compress_npsc( var $ ) $32767 ; 
    length npschars $256 ; 
    do i = 0 to 31, 127 to 255 ; 
      npschars = cats( npschars, byte( i ) ) ; 
    end ; 
    return( compress( var, npschars ) ) ; 
  endsub ; 
run ; 

options cmplib = ( work.funcs ) ; 

data test ; 
  length sc cc $4 ;
  sc = cat( byte( 160 ), 'ABC' ) ; 
  cc = compress_npsc( sc ) ; 
  rank_sc = rank( sc ) ;
  rank_cc = rank( cc ) ;
  hex_sc = put( sc, hex8. ) ;
  hex_cc = put( cc, hex8. ) ;
run ;