Chinese Keyword Extraction by Term Positions

WANG Jiayue  
【摘要】:Keywords are the best content descriptors, more effective than other index terms for information retrieval (IR) systems, especially when the rapidly growing information sources are putting retrieval precision into highlight. Statistics based IR and keyword extraction (KE) systems view documents as bags of unordered words, treating all index terms as equally important, without regard to their syntactic position. This paper tests the intuition that the syntactic position of Chinese nominal phrases is helpful for keyword extraction and compares the results with those of KE that is based on text position-a more widely used dimension. Web pages can be treated much in the same way as normal text. Our investigation of some web search engines shows that their conceptions of relevance are different. Based on a detailed discussion of relevance, it is argued that there has not been a good link between the operability of system-oriented relevance and the rich achievements of user-oriented relevance studies. It is decided that topical relevance ought to be the attitude to be taken by web search engines and to be assumed in the present research. The approach to topic extraction based on human intuition is believed to be a promising direction worthy of efforts, because by extracting topic words, the subset of documents that really matches the user's information need can be clearly determined, unlike the "standard" retrieval systems that only decide which documents are possibly relevant. Given that such human intuitions about relevancy can be well described, topically relevant results can be successfully retrieved and the outcome of the IR system will be more satisfactory. We conducted a corpus-based study of (a) text position-keywordhood and (b) syntactic position-keywordhood relation. Attention is focused on Base NPs, which are manually annotated from a collection of technical documents, with their text position (title, introduction/conclusion) and syntactic position (subject, verb complement etc.) marked according to a pre-designed scheme. The statistic results of the first experiment showed a high correlation between the Base NTs' syntactic position and their potential of being keywords. Subsequent experiments confirmed the belief that text position was helpful for KE, but syntactic position appeared not, which led to the conclusion that text position was more valuable than syntactic position with regard to KE.

