Something about Go strings that you should know

What will be the output of the following code snippet?

package main

import (
    "fmt"
)

func main() {
    var s = "abcdè"
    fmt.Println(len(s))
}

The string s has 5 characters and therefore the output should be 5 right? or Is it 🤔

If we try running the code here: https://play.golang.com/p/4F4ZkyWJAiQ, we can find that the output is 6. That's interesting, why is it so? Let's try to understand.

In Go a string is nothing but an immutable byte slice. Let's examine our string and what it is contains.

package main

import (
    "fmt"
)

func main() {
    var s = "abcdè"
    fmt.Printf("% x",s)
}

https://play.golang.com/p/Q0m5l6fndhy

On executing the code, we are greeted with the following output.

61 62 63 64 c3 a8

Now things are starting to unfold, we see that we actually have 6 bytes in the string slice and therefore we got the length as 6 for the string.

But, what are these random values? Let's find out.

package main

import (
    "fmt"
)

func main() {
    var s = "abcdè"
    fmt.Printf("%+q",s)
}

%+q escapes any non-ASCII bytes

https://play.golang.com/p/IlQUDUX9DHB

"abcd\u00e8"

Since %+q escapes non-ASCII characters, we can see that the last character has 00E8 Unicode value. Go uses UTF-8 Encoding so the hex value c3 a8 that we got earlier was indeed UTF-8 encoded hex value of 00E8. So our string variable had hex values corresponding to the unicode values in it's byte slice.

$ printf '\x61\n'
a
$ printf '\x62\n'
b
$ printf '\x63\n'
c
$ printf '\x64\n'
d
$ printf '\xC3\xA8\n'
è

So, that was the reason behind that behaviour. Feel free to post any questions and feedback below in the comments.

Find More:

Aravind

Foodie🍗 | Motorhead 🏎️ | GSW 🏀 | Software Engineer @hashnode 🖥️

Write your comment…

Yeah, this is kind of annoying (it's the same in Rust).

I used to wonder if it was just for performance that they don't count "characters", because that's obviously what we want to know.

But turns out characters is a hard to define concept, and certainly unicode code point don't correspond to characters. There are all kinds of languages and even symbols and pictures in unicode, many made up of different 'parts'.

It's not a technical issue, the real world is just messy. I think it's the way other languages are going too. It's one of the bigger changes in Python 3 and even Java might change (Java actually already had unicode, but chose utf16).

This is an interesting article about the topic.

Monospace also gets tricky, I've switched to proportional fonts for most coding.

Totally Agree Mark

Reply to this…

That’s why Go introduce the concept of rune, which is dedicated to deal with Unicode characters

s := "abcdè"
runes := []rune(s)
fmt.Printf("%d \n", len(s))
fmt.Printf("%d \n", len(runes))

Will print

6
5

playground: play.golang.com/p/9LipbYhD3j9

Show all replies

Definitely closer, but e.g. flags like 🇺🇸 show 2 runes and 👨‍👨‍👧‍👧 is 7, which is probably not what you want. This thing नी is 2, and I̪͉̜̼̼̣̟̣ ̰̟̥̞̹c͈͔͇̼a̙̹̼̦̲̞n̙̺̳̟ͅ ̤̗d̘̭̙̪̦o̬̲̜̺ ̲̬̝t̺̖̗̩̱h̟̟̱i̹s̹̱ for over 60.

It's much better than bytes though, and worked for the Chinese phrases I threw at it!

Reply to this…