There is a growing security concern on the increasing number of databases that are accessible through the Internet. Such databases
may contain sensitive information like credit card numbers and personal medical histories. Many e-service providers are reported
to be leaking customers’ information through their websites. The hackers exploited poorly coded programs that interface with
backend databases using SQL injection techniques. We developed an architectural framework, DIDAFIT (Detecting Intrusions in
DAtabases through FIngerprinting Transactions) [1], that can efficiently detect illegitimate database accesses. The system works by matching SQL statements against a known
set of legitimate database transaction fingerprints. In this paper, we explore the various issues that arise in the collation,
representation and summarization of this potentially huge set of legitimate transaction fingerprints. We describe an algorithm
that summarizes the raw transactional SQL queries into compact regular expressions. This representation can be used to match
against incoming database transactions efficiently. A set of heuristics is used during the summarization process to ensure
that the level of false negatives remains low. This algorithm also takes into consideration incomplete logs and heuristically
identifies “high risk” transactions.