Cross-Language Information Retrieval (CLIR) resources, such as dictionaries and parallel corpora, are scarce for special domains.
Obtaining comparable corpora automatically for such domains could be an answer to this problem. The Web, with its vast volumes
of data, offers a natural source for this. We experimented with focused crawling as a means to acquire comparable corpora
in the genomics domain. The acquired corpora were used to statistically translate domain-specific words. The same words were
also translated using a high-quality, but non-genomics-related parallel corpus, which fared considerably worse. We also evaluated
our system with standard information retrieval (IR) experiments, combining statistical translation using the Web corpora with
dictionary-based translation. The results showed improvement over pure dictionary-based translation. Therefore, mining the
Web for comparable corpora seems promising.
Keywords Cross-language information retrieval - Focused crawling - Comparable corpora