Converting Unicode to string
Table of contents
Unicode in JavaScript
I thought it would be cool to share this idea I had a few days ago. I was working on something related to strings in JavaScript. It may not sound interesting since almost everyone knows strings in JavaScript, but not in-depth, and that's okay. No one really cares that much in their day-to-day life or work.
What caught my attention while dealing with strings was that I accidentally put a special symbol inside a string. All special symbols are technically represented in the form of Unicode in JavaScript, as far as I know. Here's the definition of Unicode from codeburst.io:
Unicode represents the characters it supports via numbers called code points. The hexadecimal range of code points is 0x0 to 0x10FFFF (17 times 16 bits).
That sounds cool, but we're not going to deep dive into that.
A JavaScript code unit is 16 bits wide (0x0000 - 0xffff). We can use String.prototype.charCodeAt() to convert a string to its corresponding integer.
/** UTF-16 -> integer */
const input = '∆';
const charCode = input.charCodeAt(0);
console.log(charCode); // 8710
The triangle symbol returns an integer value of 8710, which is safe for the UTF-16 range. We can convert the integer value back to its symbol to prove its correctness.
/** integer -> UTF-16 */
const charCode = 8710;
const hex = charCode.toString(16); /** UTF-16 is in the hex form which is 0x2206 */
console.log('\u2206' === String.fromCharCode(8710)); // Returns true. Both methods work!
Beautiful! How about Unicode like emoji characters? Let's give it a try.
/** Unicode -> integer */
const input = '😄';
const charCode = input.charCodeAt(0); // 55357
console.log('\u55357' === String.fromCharCode(charCode)); // Returns false. What!?
It looks like its integer is still within the UTF-16 range, but the inverse returns a different result. What's going on here?
According to the description of the method String.prototype.charCodeAt(), it always returns a value that is less than 65536. In order to examine the actual value of the emoji, we have to use another new method introduced in ES6 - String.prototype.codePointAt().
The modified code will be as follows:
/** Unicode -> integer (with String.prototype.codePointAt()) */
const input = '😄';
const charCode = input.codePointAt(0); // 128516 or 0x1f604
You can see that the actual integer value is now larger than what's defined in UTF-16, and it totally defies what was mentioned earlier.
By taking a closer look at the composition of the emoji, you'll find an interesting fact.
console.log('😄'.split('')); // Array [ "\ud83d", "\ude04" ]
Now you can see how things work behind the scenes. Instead of one character, an emoji contains more than one to store such a large character code. That makes more sense now. Besides that, another useful method introduced in ES6 to convert the integer back to its actual representation is String.fromCodePoint().
console.log(String.fromCodePoint()(128516)); // 😄
Wrap-up
All JavaScript code units are 16-bit. Commonly used Unicode values can be represented in a 16-bit number in the early days, but it's no longer fitting the use case in today's world. Therefore, String.fromCodePoint() and String.prototype.codePointAt() were introduced in ES6 to deal with a more complete range of Unicode characters that consists of more than one code unit.
That's it for now. Have a wonderful day ahead, and I'll see you in the upcoming blog post. Peace! ✌️