-
Notifications
You must be signed in to change notification settings - Fork 81
Open
Description
Hallo,
The code as it stands won't read a UTF-8 vocab from a word2vec binary model created using the C version of word2vec.
This is because the vocab's characters are appended to a string buffer as if a byte is a character.
A workaround/hack like this in Word2VecModel.java's fromBinFile() method gets around this issue and probably still works for single-byte characters:
byte[] buff = new byte[1024];
for (int lineno = 0; lineno < vocabSize; lineno++) {
// read vocab
int bpos = 0;
byte b = buffer.get();
while (b != ' ') {
if (b != '\n') {
buff[bpos++] = b;
}
b = buffer.get();
}
vocabs.add(new String(buff, 0, bpos, "UTF-8"));
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels