Unicode binary conversion (Java)

Posted May 27, 20202 min read

The content is personal learning experience, can not make too much guarantee of accuracy, and hope to give pointers to the mistakes.

Sometimes we will encounter some strings beginning with u. We know that these are Unicode codes. A group of uxxxx strings corresponds to a Unicode character. What is the actual binary storage format of these encoded characters?
We know that Unicode encoding can present most of the text in the world, and in its most common encoding method UTF-8
), The storage length of a single character is 1-4 bytes(variable), the origin and advantages of this design will not be talked about here, here mainly talk about the conversion method of the u-code string and the binary.
Under the Java code of UTF-8 encoding format, the bytes and characters of "test" are printed as follows:

        String s = "Test";
        System.out.println(s.chars(). MapToObj(Integer ::toHexString) .collect(Collectors.joining("\ t")));
        byte []bs = s.getBytes();
       /* Result:
        6d4b 8bd5
        [-26, -75, -117, -24, -81, -107]* /

Observation results show that the two words "test" occupy six bytes under UTF-8 encoding, and convert [-26, -75, -117, -24, -81, -107]6 numbers into binary The complement format, that is, the binary storage content of the word "test", is:
11100110 10110101 10001011 11101000 10101111 10010101
And 6d4b 8bd5 obtained by char.ToHexString is the Unicode encoding of these two words
How are the two related?
The UTF-8 encyclopedia page has the following introduction:

UTF-8 encoded byte meaning
  • For any byte B in UTF-8 encoding, if the first bit of B is 0, B independently represents a character(ASCII code);
  • If the first bit of B is 1, and the second bit is 0, then B is a byte in a multi-byte character(non-ASCII characters);
  • If the first two bits of B are 1, and the third bit is 0, then B is the first byte of the character represented by the two bytes;
  • If the first three digits of B are 1, and the fourth digit is 0, then B is the first byte of the characters represented by the three bytes;
  • If the first four digits of B are 1, and the fifth digit is 0, then B is the first byte of the characters represented by the four bytes;

Therefore, for the binary string obtained above, the front part of every 8 bits is used for marking. The beginning of 1110 indicates that 3 bytes are required to describe the current character, and the current byte is the first part of the 3 bytes. The following bytes start with 10 to indicate that they are the last part of the current character encoding string.
Remove and merge the first three bytes to get 0110 110101 001011, and the hexadecimal Unicode encoding of "_ _" is converted to binary, which is 0110 1101 0100 1011
The advantages of this are obvious, easy to expand(it seems that it can support 8-byte encoding), the coding structure removes the binary mark bits, and the reduced size makes it easier to transfer data. The 1-byte UTF-8 code is also fully compatible with ASCII code, so UTF-8 can be said to be the optimal choice in most scenarios.