Today, many scientific data sets are open to the public. For their operators, it is important to know what the users are interested in. In this paper, we study the problem of extracting and analyzing patterns from the query log of a database. We focus on design errors (antipatterns), which typically lead to unnecessary SQL statements. Such antipatterns do not only have a negative effect on performance. They also introduce bias on any subsequent analysis of the SQL log. We propose a framework designed to discover patterns and antipatterns in arbitrary SQL query logs and to clean antipatterns. To study the usefulness of our approach and to reveal insights regarding the existence of antipatterns in real-world systems, we examine the SQL log of the SkyServer project, containing more than 40 million queries. Among the top 15 patterns, we have found 6 antipatterns. This result as well as other ones gives way to the conclusion that antipatterns might falsify refactoring and any other downstream analyses.
We propose a framework designed to discover patterns and antipatterns in arbitrary SQL query logs and to clean antipatterns. Depending on the analysis target, we intend to find query templates or patterns (series of query templates) within the log, or identify and solve antipatterns.
Query-log analysis currently is a field of intensive study. An elementary distinction is between web logs and SQL logs. Studies on web-log processing such as and tend to focus on understanding of the user behaviour through their information-seeking activities. studies web search engine optimization by mining past queries. proposes a context-aware query recommendation approach by mining click-through and session data.
Another promising branch of SQL log analysis is diagnosing and repairing data errors caused by erroneous updates. works with a log of update queries, UPDATE, INSERT and DELETE statements, and a set of known data errors to find and fix mistakes within a dataset.
Overall, query log analysis has different research threads, namely query recommendation, understanding user behavior and diagnosing errors.
The processing of DML queries does not address our specific problem – finding antipatterns.
The Existing System does not help regarding antipatterns in an already existing query log.
The purpose of our framework is to analyze query logs. Depending on the analysis target, we intend to find query templates or patterns (series of query templates) within the log, or identify and solve antipatterns.
Original Query Log:
The original query log is the only input that is required. Our approach does not need to have access to the database and does not introduce any load overhead. We require the log to consist of SQL statements together with their execution time
The first processing step is deleting duplicate queries. We perceive duplicates as unintended errors. Consequently, we record the number of duplicate removals in the result statistics. From a conceptual point of view, we argue that two identical statements executed by the same user only stand for the same information need in case the time difference is smaller than the threshold.
Parsing Statements and Parsed Query Log
In this step, we parse all queries and build a syntax tree for each of them. If the parsing finds syntax errors, the process will not consider the statement any further. We also exclude non-select statements. The parsing procedure also extracts the SELECT, FROM and WHERE subtrees and their skeleton forms for each statement.
Query Templates and Patterns
After one cleaning step, there can be further solvable antipatterns. To check this, one needs to parse statements again and possibly solve the antipatterns
Solving Antipatterns, Clean Query Log and Statistics
The procedure returns a cleaned query log and statistical information regarding antipatterns solved.
We propose a solution for the detection of patterns and for solving antipatterns in such a log.
Moreover, it is feasible to remove the most frequently occurring antipatterns.
Clearly, queries suggested by a recommender system must not contain antipatterns.
If the rate now is much smaller, then our approach obviously is more useful compared to the outcome that it is not.
System : Pentium IV 2.4 GHz.
Hard Disk : 40 GB.
Floppy Drive : 1.44 Mb.
Monitor : 15 VGA Colour.
Mouse : Logitech.
Ram : 512 Mb.
Operating system : Windows XP/7.
Coding Language : JAVA/J2EE
IDE : Netbeans 7.4
Database : MYSQL
H. V. Nguyen et al., "Identifying User Interests within the Data Space –a Case Study with SkyServer," in EDBT, Brussels, 2015.