Since I was curious about PHP Foreign Function Interface, after the release of PHP 7.4.0 I decided to write the library PHP Stemmer, a simple PHP interface to the Snowball stemming algorithms.
After some time I thought I could take advantage of that experience for having some more fun, so I thought that writing another interface to the Snowball stemming algorithms (this time with another technology, Node.js) could be the right choice and I wrote Node Stemmer.
With Node.js in order to have something like PHP FFI, you can choose between a few possibilities and I chose node-ffi-napi.
During the process I realized that Node.js FFI was more difficult to handle for me than PHP FFI, perhaps because using ref to turn buffer instances into C pointers and vice-versa assumes you are pretty much aware of what you are doing and, at the beginning, I certainly was not.
Anyway the interesting thing I want to stress in this post is how the string size matters when you are handling C pointers to strings.
Let’s say you have to deal with a header like the one below.
struct sb_stemmer;
typedef unsigned char sb_symbol;
const sb_symbol * sb_stemmer_stem(struct sb_stemmer * stemmer, const sb_symbol * word, int size);
Putting aside all the other things (that I quoted only for giving context) let’s focus on the last argument of sb_stemmer_stem()
, that is size
, an int
that in this case
represents the size in bytes of the word you want to get the stem back.
The first 128 Unicode code points are encoded as 1 byte in UTF-8 and this may be confusing, programming in PHP, especially if you are used to handle only italian words, because using this language, for a string the length in characters is often the same as its length in bytes.
So, let’s use the portuguese!
<?php
$string = "atribuição";
echo strlen($string); // 12
echo iconv_strlen($string); // 10
As you can see by reading the documentation strlen()
returns the length in bytes of a string. So if you need the length
in characters, you need to use iconv_strlen()
.
But what about JavaScript?
const string = "atribuição";
console.log(string.length); // 10
console.log(new Blob([string]).size); // 12
It sounds a little counterintuitive compared to PHP (even if probably it is more intuitive in general terms…) but if you try to get the
length
of a JavaScript string, you’ll get its length in characters.
If you need its length in bytes, you need to rely on Blob()
.