Looking for Better Chinese Indexes: A Corpus-based Approach to Base NP Detection and Indexing

Hongbiao CHEN  
【摘要】:Previotfs studies have shown that the use of phrases to represent a document抯 content can enhance the effectiveness of an automatic information retrieval (IR) system. However, among those few Chinese IR systems that have adopted phrase indexing strategy, most do not have a real automatic phrase finder. They merely extract phrases by means of maximum matching against a pre-compiled dictionary. On the other hand, the structures of the phrases extracted by most current phrase extraction methods are too complicated for indexing. This study proposes the use of Chinese base noun phrase (baseNP) as a complex indexing unit. A relatively effective and easy-to-be-implemented baseNP extraction method and a baseNP indexing method have been designed and tested. Chinese baseNP is defined as a combination of conceptual words. A corpus-based approach is adopted to acquiring the probabilities of words, tags and tag sequences in constituting baseNPs. Four detection algorithms have been designed and tested. The results show that 90.21% of the word combinations that contain good baseNPs can be extracted with the help of the word's probability information only. By combining template checking, the hybrid method can produce a precision of 60.43% and a recall of 58.93%. Two kinds of index databases have been generated: one is with the single words only (i.e., the single word indexing method) and the other is with single word supplemented with baseNPs (i.e., the baseNP indexing method). Retrieval experimental results show that baseNP indexing method can increase the retrieval precision at an average rate of 23.10% as compared to single word indexing method. It is concluded that baseNP is a kind of complex indexes capable of enhancing Chinese JR system performances and the baseNP indexing method is more effective than single word indexing method. The Chinese Experimental JR System (CEIRS 1.0) was developed and used as the retrieval experimental environment. Vector Space Model (VSM) is adopted as the retrieval model.

