This tutorial shows you how to encode and decode characters with UTF-8, UTF-16, and UTF32 in Dart.
Character encoding is used to represent a character as bytes. There are some encoding formats, UTF-8 is the most commonly used. There are also other UTF encodings such as UTF-16 and UTF-32. The main diference is how many bytes needed to represent a character. UTF-8 uses at least one byte, UTF-16 uses at least 2 bytes, while UTF-32 always uses 4 bytes. This tutorial gives you examples of how to perform encoding and decoding with those formats in Dart, including how to set endianness and BOM (Byte Order Mark).
Dependencies
Dart's built-in convert
package only supports UTF-8. For UTF-16 and UTF-32, you can use utf
package. Add the following in the dependencies
section of your pubspec.yaml
file, then run `Get dependencies'.
dependencies:
utf: 0.9.0+5
Using convert
Package
To use Dart's convert
package, import the library first by adding the following:
import 'dart:convert';
To perform encoding, use:
List<int> utf8.encode(String input)
You only need to pass the string to be encoded.
To decode the bytes into a String, use:
utf8.decode(List<int> bytes, { bool allowMalformed = false })
If allowMalformed
is set to true
, it will replace invalid or unterminated octet sequences with the Unicode Replacement character `U+FFFD` (�). If it's set to false
and invalid sequence exist, it will throw FormatException
.
Here is the usage example:
List<int> bytes = utf8.encode('www.woolha.com');
String value = utf8.decode(encoded);
print('bytes: $bytes');
print('value: $value');
Output:
bytes: [119, 119, 119, 46, 119, 111, 111, 108, 104, 97, 46, 99, 111, 109]
value: www.woolha.com
Using utf
Package
Besides UTF-8, the utf
also supports UTF-16 and UTF-32. Below are the list of functions for encoding and decoding provided by utf
package.
List encodeUtf8(String str)
String decodeUtf8(List<int> bytes,
[int offset = 0,
int length,
int replacementCodepoint = UNICODE_REPLACEMENT_CHARACTER_CODEPOINT])
List encodeUtf16(String str)
List encodeUtf16be(String str, [bool writeBOM = false])
List encodeUtf16le(String str, [bool writeBOM = false])
String decodeUtf16(
List<int> bytes,
[int offset = 0,
int length,
int replacementCodepoint = UNICODE_REPLACEMENT_CHARACTER_CODEPOINT]
)
String decodeUtf16be(
List<int> bytes,
[int offset = 0,
int length,
bool stripBom = true,
int replacementCodepoint = UNICODE_REPLACEMENT_CHARACTER_CODEPOINT]
)
String decodeUtf16le(
List<int> bytes,
[int offset = 0,
int length,
bool stripBom = true,
int replacementCodepoint = UNICODE_REPLACEMENT_CHARACTER_CODEPOINT]
)
List encodeUtf32(String str)
List encodeUtf32be(String str, [bool writeBOM = false])
List encodeUtf32le(String str, [bool writeBOM = false])
String decodeUtf32(
List<int> bytes,
[int offset = 0,
int length,
int replacementCodepoint = UNICODE_REPLACEMENT_CHARACTER_CODEPOINT]
)
String decodeUtf32be(
List<int> bytes,
[int offset = 0,
int length,
bool stripBom = true,
int replacementCodepoint = UNICODE_REPLACEMENT_CHARACTER_CODEPOINT]
)
String decodeUtf32le(
List<int> bytes,
[int offset = 0,
int length,
bool stripBom = true,
int replacementCodepoint = UNICODE_REPLACEMENT_CHARACTER_CODEPOINT]
)
The functions for different encodings are similar. For encoding, the only required parameters is the value (String). For UTF-16 and UTF-32, you can choose between BE (Big Endian) and LE (Little Endian) by using function with be
and le
suffix respectively. By default, the functions without suffix uses Big Endian. The functions with suffix are not available for UTF-8 as it's read byte by byte regarless of the CPU architecture. For functions with BE and LE suffix, there is also an optional parameter:
writeBOM
: determines whether the BOM (Byte Order Mark) should be written.
For decoding, UTF-16 and UTF-32 also have the BE and LE variants. You need to pass the bytes (List<int>
) as the first argument. The optional parameters are:
offset
: an offset into a list of bytes.length
: limit the length of the values to be decoded.stripBom
: whether to strip the leading BOM.replacementCodepoint
: the replacement character. Default to0xffd
.
Below are the examples of using the encoding functions mentioned above on the same string as well as the functions for decoding, followed by the output.
var text = 'woolha.com';
var _8 = encodeUtf8(text);
print('8: $_8');
var _16 = encodeUtf16(text);
var _16Le = encodeUtf16le(text);
var _16LeBom = encodeUtf16le(text);
print('16: $_16'); // Big Endian
print('16LE: $_16Le'); // Little Endian
print('16LE - BOM: $_16LeBom'); // Little Endian, writeBOM = true
var _32 = encodeUtf32(text);
var _32Le = encodeUtf32le(text);
var _32LeBom = encodeUtf32le(text);
print('32: $_32'); // Big Endian
print('32LE: $_32Le'); // Little Endian
print('32LE - BOM: $_32LeBom'); // Little Endian, writeBOM = true
print('32: ${encodeUtf32(text)}'); // Big Endian
print('32LE: ${encodeUtf32le(text)}'); // Little Endian
print('32LE - BOM: ${encodeUtf32le(text, true)}'); // Little Endian, writeBOM = true
print('8 - value: ${decodeUtf8(_8)}');
print('16 - value: ${decodeUtf16(_16)}');
print('16LE - value: ${decodeUtf16le(_16Le)}');
print('16LE - BOM - value: ${decodeUtf16le(_16LeBom)}');
print('32 - value: ${decodeUtf32(_32)}');
print('32LE - value: ${decodeUtf32le(_32Le)}');
print('32LE - BOM - value: ${decodeUtf32le(_32LeBom)}');
Output::
8: [119, 111, 111, 108, 104, 97, 46, 99, 111, 109]
16: [254, 255, 0, 119, 0, 111, 0, 111, 0, 108, 0, 104, 0, 97, 0, 46, 0, 99, 0, 111, 0, 109]
16LE: [119, 0, 111, 0, 111, 0, 108, 0, 104, 0, 97, 0, 46, 0, 99, 0, 111, 0, 109, 0]
16LE - BOM: [119, 0, 111, 0, 111, 0, 108, 0, 104, 0, 97, 0, 46, 0, 99, 0, 111, 0, 109, 0]
32: [0, 0, 254, 255, 0, 0, 0, 119, 0, 0, 0, 111, 0, 0, 0, 111, 0, 0, 0, 108, 0, 0, 0, 104, 0, 0, 0, 97, 0, 0, 0, 46, 0, 0, 0, 99, 0, 0, 0, 111, 0, 0, 0, 109]
32LE: [119, 0, 0, 0, 111, 0, 0, 0, 111, 0, 0, 0, 108, 0, 0, 0, 104, 0, 0, 0, 97, 0, 0, 0, 46, 0, 0, 0, 99, 0, 0, 0, 111, 0, 0, 0, 109, 0, 0, 0]
32LE - BOM: [119, 0, 0, 0, 111, 0, 0, 0, 111, 0, 0, 0, 108, 0, 0, 0, 104, 0, 0, 0, 97, 0, 0, 0, 46, 0, 0, 0, 99, 0, 0, 0, 111, 0, 0, 0, 109, 0, 0, 0]
32: [0, 0, 254, 255, 0, 0, 0, 119, 0, 0, 0, 111, 0, 0, 0, 111, 0, 0, 0, 108, 0, 0, 0, 104, 0, 0, 0, 97, 0, 0, 0, 46, 0, 0, 0, 99, 0, 0, 0, 111, 0, 0, 0, 109]
32LE: [119, 0, 0, 0, 111, 0, 0, 0, 111, 0, 0, 0, 108, 0, 0, 0, 104, 0, 0, 0, 97, 0, 0, 0, 46, 0, 0, 0, 99, 0, 0, 0, 111, 0, 0, 0, 109, 0, 0, 0]
32LE - BOM: [255, 254, 0, 0, 119, 0, 0, 0, 111, 0, 0, 0, 111, 0, 0, 0, 108, 0, 0, 0, 104, 0, 0, 0, 97, 0, 0, 0, 46, 0, 0, 0, 99, 0, 0, 0, 111, 0, 0, 0, 109, 0, 0, 0]
8 - value: woolha.com
16 - value: woolha.com
16LE - value: woolha.com
16LE - BOM - value: woolha.com
32 - value: woolha.com
32LE - value: woolha.com
32LE - BOM - value: woolha.com
You can see the difference ouput bytes as the result of using different encodings and endiannesses. For decoding, using the right function based on the encoding and endianness is also important to get the correct value.