Lecture Notes in Computer Science, 1998, Volume 1532/1998, 431-433, DOI: 10.1007/3-540-49292-5_56

TDDA, a Data Mining Tool for Text Databases: A Case History in a Lung Cancer Text Database

Jeffrey A. Goldman, Wesley Chu, D. Stott Parker and Robert M. Goldman

View Related Documents

Abstract

In this paper, we give a case history illustrating the real world application of a useful technique for data mining in text databases. The technique, Term Domain Distribution Analysis (TDDA), consists of keeping track of term frequencies for specific finite domains, and announcing significant differences from standard frequency distributions over these domains as a hypothesis. In the case study presented, the domain of terms was the pair right, left, over which we expected a uniform distribution. In analyzing term frequencies in a thoracic lung cancer database, the TDDA technique led to the surprising discovery that primary thoracic lung cancer tumors appear in the right lung more often than the left lung, with a ratio of 3:2. Treating the text discovery as a hypothesis, we verified this relationship against the medical literature in which primary lung tumor sites were reported, using a standard χ2 statistic. We subsequently developed a working theoretical model of lung cancer that may explain the discovery

Fulltext Preview

Image of the first page of the fulltext document