Actually, you don't need to. You tell them it expects UTF-8 encoded
strings and be done with it. All US-ASCII characters from 0 through
127 (IE: high bit clear) are exactly the same in UTF-8, and UTF-8
special characters have the high bit set in all bytes. Therefore you
just assume that anything with the high bit set is part of a word and
you can handle basic UTF-8. (It doesn't work on special UTF-8 space
characters like nonbreaking space and similar, but handling those is
significantly more complicated).
Cheers,
Kyle Moffett
-