Useful Features for Abbreviation Disambiguation in Biomedical Domain
- 발행기관 고려대학교 대학원
- 발행년도 2004
- 학위명 박사
- 학과 대학원:컴퓨터학과
- 식별자(기타) DL:000014914321
- 서지제어번호 000045213479
초록/요약
Recently, the growth of biomedicine gives rise to the explosive generation of biomedical data including a vast amount of research results. However, the problem of getting the desired information from the data is occurring because these data are mostly the text data which are based on natural language, and the techniques in a natural language processing is required to extract meaningful information from these text data. Since the texts in biomedicine contain several technical terminologies, the semantic analysis of the terminologies is needed to understand the texts. Abbreviations become main terms in the terminologies since many of the terminologies are frequently represented as various abbreviations. Most of the abbreviations in biomedicine are related to more than two long forms, and the long forms of the abbreviations represent different meanings from each other. Hence, for the sake of the semantic analysis of the biomedical texts, it is required to identify the correct long forms of the abbreviations. Most of the previous works for abbreviation disambiguation were based on machine learning approaches, and they automatically constructed training data for making systems learn and test data for evaluating the systems. But the previous works have the several drawbacks: the features for the abbreviation disambiguation are naive, and the features are not evaluated for the abbreviation disambiguation. In this thesis, we define the various kinds of features and evaluate the contributions of each feature. Furthermore, we investigate the effect of combining the individual features. Finally, useful features are identified through several experimentations. Our experimentation is as follows: Firstly, a training data and a test data that contain 15 abbreviations are automatically constructed from a biomedical text data (MEDLINE). Secondly, contexts for abbreviation disambiguation are built with the 10 features from both the data. Finally, a machine learning t
more