| Title |
Predicting the Out-of-Vocabulary Rate and the Required Vocabulary Size for Speech Processing Applications |
| Authors |
Johannes Müller, Holger Stahl, Manfred Lang |
| Abstract |
This paper describes an approach for predicting both the vocabulary size and
the resulting out-of-vocabulary rate (OOV-rate) for a hypothetical extension of
an existing text corpus. By splitting the original corpus into two different
sub-corpora, vocabulary and OOV-rate can be determined for that special
constellation. Average values are calculated for all combinations of sub-corpora
and can be approximated by analytic function terms. These functions enable the
easy prediction of the vocabulary size and the OOV-rate. The prediction accuracy
results in a relative error below 4.6%.
Keywords: out-of-vocabulary rate, OOV-rate, vocabulary size, text corpus,
test corpus, training corpus |
| Reference |
Proceedings ICSLP 96 (Philadelphia, USA, 1996), pp. 658-661 |
| Year |
1996 |
| Language |
English |
| Full Paper |
download pdf file |