c# - How do I ignore the UTF-8 Byte Order Marker in String comparisons? -
i'm having problem comparing strings in unit test in c# 4.0 using visual studio 2010. same test case works in visual studio 2008 (with c# 3.5).
here's relevant code snippet:
byte[] rawdata = getdata(); string data = encoding.utf8.getstring(rawdata); assert.areequal("constant", data, false, cultureinfo.invariantculture);
while debugging test, data
string appears naked eye contain same string literal. when called data.tochararray()
, noticed first byte of string data
value 65279
utf-8 byte order marker. don't understand why encoding.utf8.getstring()
keeps byte around.
how encoding.utf8.getstring()
not put byte order marker in resulting string?
update: problem getdata()
, reads file disk, reads data file using filestream.readbytes()
. corrected using streamreader
, converting string bytes using encoding.utf8.getbytes()
, should've been doing in first place! help.
well, assume it's because raw binary data includes bom. remove bom after decoding, if don't want - should consider whether byte array should consider bom start with.
edit: alternatively, use streamreader
perform decoding. here's example, showing same byte array being converted 2 characters using encoding.getstring
or 1 character via streamreader
:
using system; using system.io; using system.text; class test { static void main() { byte[] withbom = { 0xef, 0xbb, 0xbf, 0x41 }; string viaencoding = encoding.utf8.getstring(withbom); console.writeline(viaencoding.length); string viastreamreader; using (streamreader reader = new streamreader (new memorystream(withbom), encoding.utf8)) { viastreamreader = reader.readtoend(); } console.writeline(viastreamreader.length); } }
Comments
Post a Comment