Loading...

How to work with UTF-8 encoded files in a WLATIN1 encoded SAS session?


Currently, we are using WLATIN1 (Windows 1252) encoding on our SAS session server (and there seems to be no upcoming shift to UTF-8).

I have an EXCEL file (which is an export from an internet database) which is encoded in UTF-8 and contains one special character which is not supported Wiki refernce) in WLATIN1 encoding. I know I can read in the EXCEL data using a different encoding like UTF-8but still the ODS will print = instead of .

I would like to match some strings containing this special character but SAS of course doesn't let me use this character natively. Is there any way to circumvent this?

MWE: Let's assume the EXCEL file contains one variable a and one observation a = 'this is a test ≥':

data encoding;
set excel;

if a ='this is a test ≥' then
put 'it works';
else 
put 'it does not work';

run;

Can this be fixed in any way? I tried '(*ESC*){unicode "2265"x}' instead of but couldn't get it to work. As always, any help/idea is very much appreciated!

EDIT: We are running SAS Release 9.4 TS1M5. Currently, the read-in of the EXCEL file is done by using PROC IMPORT:

filename temp "*.xlsx" encoding="utf-8";

proc import datafile=temp out=quality dbms=excel replace;
run;
- - Source

Answers

answered 1 week ago Richard #1

If you are in SAS Server environment you will need to setup a server with Unicode support startup settings.

In desktop environment a session can be started with Unicode support from Icon deep in SAS start menu. The command line is:

"C:\Program Files\SASHome\SASFoundation\9.4\sas.exe" 
           -CONFIG "C:\Program Files\SASHome\SASFoundation\9.4\nls\u8\sasv9.cfg"

The nls\u8\ config file will have some lines with encoding settings that can only be applied at the startup of the session, as well as pathing to the SAS dlls supporting a utf8 session.

…
-SET SASCFG "C:\Program Files\SASHome\SASFoundation\9.4\nls\u8"
-DBCS 
-LOCALE en_US
-ENCODING UTF-8
…

In a Unicode session the log, of the sample code below, will show a discrimination between ≥ and =, and ODS will output ≥. The same code in a default SBCS session the ≥ is mapped to = even when the code editor show ≥.

The font of the LOG window should be set to Consolas or other UTF-8 aware font.

data have;
input; a = _infile_; datalines;
this is a test ≥
run;

data want;
  set have;

  c1 = '≥';
  c2 = '=';
  put "NOTE: " (c:) (=);

  r1 = rank(c1);
  r2 = rank(c2);

  put "NOTE: " (r:) (=);

  if a = 'this is a test ≥' 
    then put "NOTE: " a 'it works';
    else put "NOTE: " a 'it does not work';
run;
proc print data=want;
run;
--------------------
NOTE: c1== c2==
NOTE: r1=226 r2=61
NOTE: this is a test = it works

The same code in a default (SBCS) session shows ≥ will have been transcoded to =

NOTE: c1== c2==
NOTE: r1=61 r2=61
NOTE: this is a test = it works

The Enhanced Editor might be UTF-8 aware in all cases, but (I speculate) when run the submittal is transcoded to the session encoding.

comments powered by Disqus