An evaluation of feature sets and sampling techniques for de-identification of medical records

TitleAn evaluation of feature sets and sampling techniques for de-identification of medical records
Publication TypeConference Paper
Year of Publication2010
AuthorsGardner, J., L. Xiong, F. Wang, A. Post, J. Saltz, and T. Grandison
Conference NameProceedings of the 1st ACM International Health Informatics Symposium
PublisherACM
KeywordsConditional random fields, De-identification, HIDE, Medical text
Abstract

De-identification of textual medical records is of critical importance in any health informatics system in order to facili- tate research and sharing of medical records. In this paper, we present the Health Information DE-identification (HIDE) framework and evaluate the open-source software. We present an evaluation of various types of fea- tures used in HIDE, and introduce a window sampling technique (only the terms within a specified distance from per- sonal health information are used to train the classifier) and evaluate its effect on both quality and efficiency. Our results show that the context features (previous and next terms) are particularly important and the sampling technique can be used to increase recall with minimal impact on precision. We obtained token-level label precision of 0.967, recall of 0.986 and F-Score of 0.977 when not including true negatives. The overall HIDE system achieves token-level precision of .998, recall of .999, and f-score of .999 on the previous i2b2 chal- lenge task.

URLhttp://www.mathcs.emory.edu/ lxiong/research/pub/ihi2010.pdf