Looking for Better Chinese Indexes: A Corpus-based Approach to Base NP Detection and Indexing
【摘要】：Previotfs studies have shown that the use of phrases to represent a document抯
content can enhance the effectiveness of an automatic information retrieval (IR)
system. However, among those few Chinese IR systems that have adopted phrase
indexing strategy, most do not have a real automatic phrase finder. They merely
extract phrases by means of maximum matching against a pre-compiled dictionary.
On the other hand, the structures of the phrases extracted by most current phrase
extraction methods are too complicated for indexing.
This study proposes the use of Chinese base noun phrase (baseNP) as a
complex indexing unit. A relatively effective and easy-to-be-implemented baseNP
extraction method and a baseNP indexing method have been designed and tested.
Chinese baseNP is defined as a combination of conceptual words. A
corpus-based approach is adopted to acquiring the probabilities of words, tags and
tag sequences in constituting baseNPs. Four detection algorithms have been
designed and tested. The results show that 90.21% of the word combinations that
contain good baseNPs can be extracted with the help of the word's probability
information only. By combining template checking, the hybrid method can
produce a precision of 60.43% and a recall of 58.93%.
Two kinds of index databases have been generated: one is with the single
words only (i.e., the single word indexing method) and the other is with single
word supplemented with baseNPs (i.e., the baseNP indexing method).
Retrieval experimental results show that baseNP indexing method can
increase the retrieval precision at an average rate of 23.10% as compared to single
word indexing method. It is concluded that baseNP is a kind of complex indexes
capable of enhancing Chinese JR system performances and the baseNP indexing
method is more effective than single word indexing method.
The Chinese Experimental JR System (CEIRS 1.0) was developed and used
as the retrieval experimental environment. Vector Space Model (VSM) is adopted
as the retrieval model.