Guidewire, SAP, Oracle, UNIX, Genesys Technology Blog: MS SQL Server: bcp import unicode file

2016-12-06

MS SQL Server: bcp import unicode file

Product: MS SQL Server
Version: 2005 - 2016

Overview

MS SQL Server bcp command line utility and BULK IMPORT statement(T-SQL) do not provide any way to do debugging, as well as previewing. This becomes frustrating when you are using fmt (format file) as well as throwing additional complication by using Unicode text file.

This post going to focus in bcp command line utility, but the information is applicable to BULK IMPORT statement.

bcp input (import) will display very generic, and useless error message when there the fmt file is incorrectly configured to match the data file's content. For example:

Error #1

Starting copy...

SQLState = S1000, NativeError = 0
Error = [Microsoft][SQL Native Client]I/O error while reading BCP format file

BCP copy in failed

Error #2:

Starting copy...

SQLState = S1000, NativeError = 0

Error = [Microsoft][SQL Native Client]Unexpected EOF encountered in BCP data-file

BCP copy in failed

Error #3:

Starting copy...

SQLState = 22001, NativeError = 0

Error = [Microsoft][SQL Native Client]String data, right truncation

SQLState = 22001, NativeError = 0

Error = [Microsoft][SQL Native Client]String data, right truncation

SQLState = 22001, NativeError = 0

Error = [Microsoft][SQL Native Client]String data, right truncation

SQLState = 22001, NativeError = 0

Error = [Microsoft][SQL Native Client]String data, right truncation

SQLState = 22001, NativeError = 0

Error = [Microsoft][SQL Native Client]String data, right truncation

SQLState = S1000, NativeError = 0

Error = [Microsoft][SQL Native Client]Unexpected EOF encountered in BCP data-file

0 rows copied.

Network packet size (bytes): 4096

Clock Time (ms.) Total : 312

Error #4 (no error, but nothing imported):

Starting copy...

0 rows copied.

Network packet size (bytes): 4096

Clock Time (ms.) Total : 94

Introduction

Microsoft MSDN has a very poor documentation about bcp. The documentation about format file, and unicode are insufficient, as well as no example given for all use case, and even worse for unicode.

There is only one error file using "-e" parameter, but most of the time, it is empty, and not helpful in troubleshooting, as well as providing any guidance to indicate which entry in the fmt file is wrong or incorrectly configured.

There are several challenges when configuring the fmt file with unicode:

Unicode - there are several unicode format for the bcp data file, addition to ANSI (regular) text file
Windows, UNIX, MacOS end-of-line - MacOS uses CR, UNIX uses LF, Windows uses CR+LF, custom program might use LF+CR or mixture of all 4 combinations
Multi lines fields - fields which contains ENTER key which will confuse the end-of-line

Unicode Data File Format

In Windows Notepad, there are 3 choices for unicode, as shown below.

MS SQL Server bcp only support "Unicode UCS2-Little Endian" format using their "bcp -w" parameter, or SQLNCHAR in fmt file. This unicode encoding is called UCS2-Little Endian with BOM, which is subtype of UTF-16. bcp supports both UCS2-LE with or without BOM. Notepad UTF-8 is referring to UTF-8 Little Endian encoding with BOM.

Note: when you use -f to use fmt file, you cannot specify -w parameter in command line. When "bcp out -w" is used to export to text file, UCS2-Little Endian with BOM is used.

When the bcp data file is created in other OS, or downloaded from browser(s), then the unicode encoding could be UTF-8, UCS2-Little Endian (Unicode in Notepad), or UCS2-Big Endian (e.g. AIX).

The text file that I used in this post is called import1.csv with content as below

"Case","Status","Assginee"
"論壇101","New","無"

Please keep this in mind when reading the post, because we are going to examine the file content in hex. Please keep following hex code in mind:

Line 1, first 3 characters' hex code are 0x22, 0x43, 0x61 - "Ca
Line 2, first 3 characters' hex code are 0x22, 0x8AD6, 0x58C7 - "論壇

I will highlight the BOM as well as the Traditional Chinese characters which are critical in identifying unicode encryption and character set. This is not explain in detail in MS SQL Server MSDN document, and hard to find in the Internet community.

The file will be saved as Tranditional Chinese BIG5 character set when ANSI is used. Please note that this is not a supported format for bcp, and needs to convert to UTF16 UCS2 or UCS2-LE.

Uses following PowerShell script to display the first 128 bytes (32 bytes across), and you will be able to tell exactly what unicode encoding the data file belongs to. For unicode with BOM, you will see the first 2 bytes to be FF FE or FE FF (which is called BOM) depending on the byte order (aka endian). For unicode without BOM, you won't see these 2 bytes.

PS> Get-Content "import1.csv" -Encoding Byte -TotalCount 128 -ReadCount 32 | ForEach-Object {$output = "";foreach ( $byte in $_ ) {$output += "{0:X2} " -f $byte}; $output}

Output 1: UCS2-Little Endian with BOM - the first 2 BOM bytes is always "FF FE" before 0x22

FF FE 22 00 43 00 61 00 73 00 65 00 22 00 2C 00 22 00 53 00 74 00 61 00 74 00 75 00 73 00 22 00
2C 00 22 00 41 00 73 00 73 00 67 00 69 00 6E 00 65 00 65 00 22 00 0A 00 22 00 D6 8A C7 58 31 00
30 00 31 00 22 00 2C 00 22 00 4E 00 65 00 77 00 22 00 2C 00 22 00 21 71 22 00 0A 00

Output 2: UCS2-Big Endian with BOM - the first 2 BOM bytes is always "FE FF" before 0x22

FE FF 00 22 00 43 00 61 00 73 00 65 00 22 00 2C 00 22 00 53 00 74 00 61 00 74 00 75 00 73 00 22
00 2C 00 22 00 41 00 73 00 73 00 67 00 69 00 6E 00 65 00 65 00 22 00 0A 00 22 8A D6 58 C7 00 31
00 30 00 31 00 22 00 2C 00 22 00 4E 00 65 00 77 00 22 00 2C 00 22 71 21 00 22 00 0A

Output 3: UTF-8 Little Ending with BOM (3 BOM bytes before 0x22) - 2 Chinees characters are represented with 3 bytes each, total of 6 bytes. This is the default format created by Notepad UTF-8 file format

EF BB BF 22 43 61 73 65 22 2C 22 53 74 61 74 75 73 22 2C 22 41 73 73 67 69 6E 65 65 22 0A 22 E8
AB 96 E5 A3 87 31 30 31 22 2C 22 4E 65 77 22 2C 22 E7 84 A1 22 0A

Output 4: UTF-8 Little Ending without BOM - It begins with 0x22. 2 Chinees characters are represented with 3 bytes each, total of 6 bytes

22 43 61 73 65 22 2C 22 53 74 61 74 75 73 22 2C 22 41 73 73 67 69 6E 65 65 22 0A 22 E8 AB 96 E5
A3 87 31 30 31 22 2C 22 4E 65 77 22 2C 22 E7 84 A1 22 0A

Output 5: ANSI Tranditional Chinese BIG5 character set

22 43 61 73 65 22 2C 22 53 74 61 74 75 73 22 2C 22 41 73 73 67 69 6E 65 65 22 0A 22 BD D7 BE C2
31 30 31 22 2C 22 4E 65 77 22 2C 22 B5 4C 22

Note: ANSI and UTF-8 will have identical output for ANSI characters, and will only display additional bytes when it is above FF (hex, or 255 decimal) ANSI character code. Therefore, you need to observe non-ANSI character set

Once you determined the unicode encoding is not UCS2-Little Endian, uses Notepad to save the data file as "Unicode" format. If you are using Notepad++, UltraEdit, Sublime Text, etc, then you can convert it to UCS2-LE/UCS2-Little Endian with or without BOM. bcp will accept both type.

When UTF-16 UCS2-LE is used (with or without BOM), you need to specify 2 byte characters for field deliminator as well as SQLNCHAR in fmt file for all fields. For UTF-8 or ANSI, you only specify 1 byte for field deliminator.

Wrongly identify the unicode file type will often encounter error 1 - 3 above.

End-of-Line (EOL)

EOL is a common problem when the data file is not exported from MS SQL Server database, but receive from other source, e.g. OS, save from browser, or programs. You need to indicate the EOL sequence when importing in the fmt file.

Wrongly specify the EOL will often see error 4 above, which is

Starting copy...

0 rows copied.

Network packet size (bytes): 4096

Clock Time (ms.) Total : 94

You can use above PowerShell script, Notepad++, Sublime Editor, UltraEdit, etc to see the character sequence(s) used as EOL. Following is screenshot from Notepad++

If you use "bcp -w" to import the data file, and not using fmt file, then you need to convert it to Windows EOF, which is CRLF, instead of LF.

Following are 4 EOL format for common OS:

Windows	CRLF
UNIX/Linux	LF
MacOS	CR
Program	LFCR

If fmt file is used, then you can specify the format in any of above format.

For EOL is LFCR, I recommend to convert it to Windows or UNIX EOL, or modify the program to use one of the standard EOL. This kind of format is risky to cause some confusion by human, as well as database utilities.

Windows Notepad does not support converting the EOL, so you need to use other text editor, or PowerShell to convert the EOL to Windows. Again, if you are using fmt, then you need to specify the EOL in the fmt file, and no file conversion is needed to convert the EOL.

Multilines Field

When you have multilines field, then you need to ensure EOL is unique to be able to identify as end of record (EOR).

Scan the bcp data file to ensure the multilines field does not conflict with EOL.

In my example, my EOR is "LF, or hex code 0x22, 0x0A

If there is a conflict, then you need to modify the export program (or bcp out) with a unique character strings, such as |"LF, or hex code 0x7C, 0x22, 0x0A

Failure to do so will experience error 1 or 2 above, which complains about EOF

bcp Format File

The MS SQL Server 2012 fmt file used for this data file is as below:

9.0
4
1 SQLNCHAR 0 0 "\"\0" 0 FIRST_QUOTE Latin1_General_100_CI_AS
2 SQLNCHAR 0 20 "\"\0,\0\"\0" 1 Case Latin1_General_100_CI_AS
3 SQLNCHAR 0 50 "\"\0,\0\"\0" 2 Assignee Latin1_General_100_CI_AS
4 SQLNCHAR 0 100 "\"\0\n\0" 3 Status Latin1_General_100_CI_AS

Explanation:

Line 3: This dummy field is to remove the double quote for the column 1. For UCS2-LE, "\0" is used to indicate the 0x22, 0x00. For UTF-8 and ANSI, 0x00 is not required
Line 2: 3 data fields, pleaes 1 dummy field. Total of 4 fields
Line 3-6: Quote-Comma-Quote field separate is always padded with "\0" for UCS2-LE. For UTF-8 and ANSI, 0x00 is not required
Line 7: The quote as well as LF are used of EOR. It will remove the double quote during import
Last line must be an empty line, or bcp will complain error 1 above
2nd field must be SQLNCHAR for all columns. If UCS2-LE is used, then you need to use this type for both varchar, and nvarchar columns, else you will get error 2 above
3rd field must be 0 for UCS2-LE, else you will encoutner error 2 above
4th field length is double the size of the nvarchar, or varchar field. E.g. nvarchar(10) or varchar(10), the length will be 20

bcp command:

bcp Forum.Mgmt in D:\temp\import1.csv -f forum1.fmt -U admin1 -P [password for admin1] -S hostname\[instance name] -F 2 -e bcp_forum1.log

Note: -F 2 is to skip line 1

Tips on bcp Configuration/Troubleshooting

Uses "bcp out -w" to output to UCS2-LE csv file and compare it against your data
Create a new table with few nvarchar and varchar columns to determine the correct field deliminator and end-of-line configuration
Create a csv file with 2-5 records to test the "bcp in" import, so that you are in better control of the data format
Replace field deliminator with tab, comma, |, instead of multi-chracters deliminator (like my example is 3 characters deliminator). Focus in simple field deliminaotr configuration before move on to multi-chraacters deliminator
Remove multi-line records from the testing csv, and test with data that does not contains multi-line. This allows you to confirm that your EOL deliminator works, before troubleshoot multi-line data import
Manually remove double quotes or single-quotes which requires additional fmt configuration. Make sure you have a working fmt file, before configure more complete deliminator
Specify "bcp -b1" to force import to commit every row, so that you can determine which row is violating the format file. This is used to resolve following error, which the error log is always empty, such as below

Starting copy...

1000 rows sent to SQL Server. Total sent: 1000

SQLState = S1000, NativeError = 0

Error = [Microsoft][SQL Native Client]Unexpected EOF encountered in BCP data-filee

1876 rows copied.

Network packet size (bytes): 4096

Clock Time (ms.) Total : 811 Average : (2313.19 rows per sec.)

No comments:

Subscribe to: Post Comments (Atom)