Conversation
| target_vocabulary_type=word | ||
|
|
||
| ; Vocabulary size in each side. | ||
| unk_frequency=0 |
There was a problem hiding this comment.
Need detailed description for this option.
And I guess this option has multiple meanings:
- choosing filtering strategy as either by-frequency or by-ranking,
- specifying the threshold of unknown words in both source/target languages.
Maybe they could be separated into some unique options. For example:
unk_filter_type=frequency/rank
source_unk_frequency=3 (only used when type=frequency)
target_unk_frequency=4 (ditto)
source_vocabulary_size=4100 (only used when type=rank)
target_vocabulary_size=4900 (ditto)
| target_vocabulary_type=word | ||
| source_vocabulary_size=30 | ||
| unk_frequency=0 | ||
| source_vocabulary_size=33 |
| @@ -183,11 +184,12 @@ void initializeLogger( | |||
| nmtkit::Vocabulary * createVocabulary( | |||
There was a problem hiding this comment.
I think basically one parameter should have only one meaning to prevent abusing them. CharacterVocabulary and WordVocabulary could take more 1 parameter to choose unk filtering strategy (just specified in config file) to prevent increasing tne number of meanings in unk_frequency.
|
|
||
| WordVocabulary::WordVocabulary(const string & corpus_filename, unsigned size) { | ||
| NMTKIT_CHECK(size >= 3, "Size should be equal or greater than 3."); | ||
| WordVocabulary::WordVocabulary(const string & corpus_filename, unsigned unk_frequency, unsigned size) { |
There was a problem hiding this comment.
Could you add some test code in src/test/word_vocabulary_test.cc for unk_frequency?
|
|
||
| CharacterVocabulary::CharacterVocabulary( | ||
| const string & corpus_filename, | ||
| unsigned unk_frequency, |
There was a problem hiding this comment.
Could you add some test code in src/test/character_vocabulary_test.cc for unk_frequency?
add option unk_frequency to adjust vocabulary size.
in a configuration file,
to ignore this option.
by setting n to it, a word with the frequency under n will be treated as an unknown word.