Skip to content

Accuracy rate seems to be 20% lower than the original C version #40

@hankcs

Description

@hankcs

Hello, dear medallia staffs.
Thank you for your nice Java code. It is beautiful, neatly but seems not precious.

I computed the accuracy rate, and it is 20% lower than the original version.
I trained on text8 with the same parameters, which are:

Java

File f = new File("text8");
        if (!f.exists())
            throw new IllegalStateException("Please download and unzip the text8 example from http://mattmahoney.net/dc/text8.zip");
        List<String> read = Common.readToList(f);
        List<List<String>> partitioned = Lists.transform(read, new Function<String, List<String>>() {
            @Override
            public List<String> apply(String input) {
                return Arrays.asList(input.split(" "));
            }
        });

        Word2VecModel model = Word2VecModel.trainer()
                .setMinVocabFrequency(5)
                .useNumThreads(20)
                .setWindowSize(8)
                .type(NeuralNetworkType.CBOW)
                .setLayerSize(200)
                .useNegativeSamples(25)
                .setDownSamplingRate(1e-4)
                .setNumIterations(15)
                .setListener(new TrainingProgressListener() {
                    @Override public void update(Stage stage, double progress) {
                        System.out.println(String.format("%s is %.2f%% complete", Format.formatEnum(stage), progress * 100));
                    }
                })
                .train(partitioned);

        try(final OutputStream os = Files.newOutputStream(Paths.get("vectors.bin"))) {
            model.toBinFile(os);
        }

C

./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 1 -iter 15

Use the same judge program and test file:

./compute-accuracy vectors.bin 30000 < questions-words.txt

Your Java implementation:

capital-common-countries:
ACCURACY TOP1: 58.30 %  (295 / 506)
Total accuracy: 58.30 %   Semantic accuracy: 58.30 %   Syntactic accuracy: nan % 
capital-world:
ACCURACY TOP1: 36.78 %  (534 / 1452)
Total accuracy: 42.34 %   Semantic accuracy: 42.34 %   Syntactic accuracy: nan % 
currency:
ACCURACY TOP1: 12.69 %  (34 / 268)
Total accuracy: 38.77 %   Semantic accuracy: 38.77 %   Syntactic accuracy: nan % 
city-in-state:
ACCURACY TOP1: 25.21 %  (396 / 1571)
Total accuracy: 33.16 %   Semantic accuracy: 33.16 %   Syntactic accuracy: nan % 
family:
ACCURACY TOP1: 55.23 %  (169 / 306)
Total accuracy: 34.80 %   Semantic accuracy: 34.80 %   Syntactic accuracy: nan % 
gram1-adjective-to-adverb:
ACCURACY TOP1: 8.07 %  (61 / 756)
Total accuracy: 30.64 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 8.07 % 
gram2-opposite:
ACCURACY TOP1: 9.48 %  (29 / 306)
Total accuracy: 29.39 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 8.47 % 
gram3-comparative:
ACCURACY TOP1: 38.25 %  (482 / 1260)
Total accuracy: 31.13 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 24.63 % 
gram4-superlative:
ACCURACY TOP1: 23.91 %  (121 / 506)
Total accuracy: 30.60 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 24.50 % 
gram5-present-participle:
ACCURACY TOP1: 22.08 %  (219 / 992)
Total accuracy: 29.53 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 23.87 % 
gram6-nationality-adjective:
ACCURACY TOP1: 63.17 %  (866 / 1371)
Total accuracy: 34.50 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 34.25 % 
gram7-past-tense:
ACCURACY TOP1: 26.35 %  (351 / 1332)
Total accuracy: 33.47 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 32.64 % 
gram8-plural:
ACCURACY TOP1: 44.25 %  (439 / 992)
Total accuracy: 34.39 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 34.17 % 
gram9-plural-verbs:
ACCURACY TOP1: 18.15 %  (118 / 650)
Total accuracy: 33.53 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 32.90 % 
Questions seen / total: 12268 19544   62.77 % 

Original C implementation:

capital-common-countries:
ACCURACY TOP1: 82.81 %  (419 / 506)
Total accuracy: 82.81 %   Semantic accuracy: 82.81 %   Syntactic accuracy: nan % 
capital-world:
ACCURACY TOP1: 62.26 %  (904 / 1452)
Total accuracy: 67.57 %   Semantic accuracy: 67.57 %   Syntactic accuracy: nan % 
currency:
ACCURACY TOP1: 23.13 %  (62 / 268)
Total accuracy: 62.22 %   Semantic accuracy: 62.22 %   Syntactic accuracy: nan % 
city-in-state:
ACCURACY TOP1: 44.68 %  (702 / 1571)
Total accuracy: 54.96 %   Semantic accuracy: 54.96 %   Syntactic accuracy: nan % 
family:
ACCURACY TOP1: 75.82 %  (232 / 306)
Total accuracy: 56.52 %   Semantic accuracy: 56.52 %   Syntactic accuracy: nan % 
gram1-adjective-to-adverb:
ACCURACY TOP1: 17.20 %  (130 / 756)
Total accuracy: 50.40 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 17.20 % 
gram2-opposite:
ACCURACY TOP1: 21.90 %  (67 / 306)
Total accuracy: 48.71 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 18.55 % 
gram3-comparative:
ACCURACY TOP1: 64.60 %  (814 / 1260)
Total accuracy: 51.83 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 43.54 % 
gram4-superlative:
ACCURACY TOP1: 39.72 %  (201 / 506)
Total accuracy: 50.95 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 42.86 % 
gram5-present-participle:
ACCURACY TOP1: 39.52 %  (392 / 992)
Total accuracy: 49.51 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 41.99 % 
gram6-nationality-adjective:
ACCURACY TOP1: 87.24 %  (1196 / 1371)
Total accuracy: 55.08 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 53.94 % 
gram7-past-tense:
ACCURACY TOP1: 38.21 %  (509 / 1332)
Total accuracy: 52.96 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 50.73 % 
gram8-plural:
ACCURACY TOP1: 67.54 %  (670 / 992)
Total accuracy: 54.21 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 52.95 % 
gram9-plural-verbs:
ACCURACY TOP1: 37.38 %  (243 / 650)
Total accuracy: 53.32 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 51.71 % 
Questions seen / total: 12268 19544   62.77 %

Can you give me any suggestions or ideas about this? I am ready to help you if needed.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions