Database workload analysis and optimization visualizations

Saved in:
Bibliographic Details
Title: Database workload analysis and optimization visualizations
Patent Number: 10255,335
Publication Date: April 09, 2019
Appl. No: 15/345375
Application Filed: November 07, 2016
Abstract: Techniques are described for analyzing usage of data stored in a data storage system without accessing the stored data. In some embodiments, workload data indicative of queries executed at the data storage system on stored data is received. This workload data can include query logs generated during execution of the queries. The workload data is processed to identify data elements such as tables, columns, and views associated with the stored data as well as information regarding usage of the identified data elements. Usage can include operations performed on the data elements during execution of the queries. Based on this processing relationships between the identified data elements can be inferred and visualizations generated that convey information regarding usage of the data stored at the data storage system. Visualizations can include, among others, usage heatmap diagrams, join diagrams, column family diagrams, filter diagrams, view lineage diagrams, data flow diagrams, denormalization diagrams, and workload distribution diagrams.
Inventors: Cloudera, Inc. (Palo Alto, CA, US)
Assignees: Cloudera, Inc. (Palo Alto, CA, US)
Claim: 1. A method for causing display of a usage heatmap diagram indicative of usage of data stored in a data storage system, the method comprising: receiving workload data indicative of queries executed at the data storage system on the stored data; processing the workload data to identify a plurality of tables present in the stored data, each of the identified tables including one or more columns; processing the workload data to determine relationships between the identified tables based on operations performed during execution of the queries; and causing display of the usage heatmap diagram based on the processing, the usage heatmap diagram including: a plurality of table nodes representative of the identified plurality of tables, each of the plurality of table nodes visually linked to one or more of the other plurality of table nodes based on the determined relationships; wherein each of the plurality of table nodes is displayed as a graphical object in a color corresponding to relative usage of the represented table and includes a graphical listing of a plurality of column identifiers representative of a plurality of columns present in the represented table, each of the one or more column identifiers ordered based on relative usage of the represented columns and displayed in a color corresponding to a category of operation associated with usage of the represented columns.
Claim: 2. A method for facilitating analysis of the usage of data stored in a data storage system without accessing the stored data, the method comprising: receiving workload data indicative of queries executed at the data storage system on the stored data; processing the workload data to identify a plurality of data elements associated with the stored data; processing the workload data to determine relationships between the identified plurality of data elements based on operations performed during execution of the queries; and causing display of an interactive visualization including a usage heatmap diagram based on the processing, the usage diagram including: a plurality of nodes representative of the plurality of data elements present in the stored data, each of the plurality of nodes visually linked to one or more of the other plurality of nodes based on the determined relationships between the plurality of data elements; wherein each of the plurality of nodes is displayed as a graphical object indicative of a relative usage of a corresponding data element.
Claim: 3. The method of claim 2 , wherein the workload data includes any of a query execution log or a data definition language (DDL) statement.
Claim: 4. The method of claim 2 , wherein the workload data includes timestamped machine-generated data resulting from execution of the queries.
Claim: 5. The method of claim 2 , wherein the identified data elements include any of a table, a column, or a view.
Claim: 6. The method of claim 2 , wherein each of the plurality of data elements is a table, and wherein each of the plurality of nodes includes: an identifier associated with the table; and a listing of column identifiers representative of columns present in the table, each of the column identifiers including information regarding usage of the columns.
Claim: 7. The method of claim 6 , wherein the identifier associated with the table includes any of: a name of the table; or an indication that the table is a fact table or a dimension table.
Claim: 8. The method of claim 6 , wherein a particular node representative of a particular table is visually represented as a rectangular box and wherein column identifiers representative of columns present in the particular table are visually represented as smaller rectangular boxes enclosed in the rectangular box representative of the particular table.
Claim: 9. The method of claim 6 , wherein a particular node representative of a particular table is visually represented as a rectangular graphical object of a particular color, wherein the particular color is indicative of any of: level of usage of columns present in the particular table; or a category of operation associated with usage of columns present in the particular table.
Claim: 10. The method of claim 6 , wherein a particular column identifier representative of a particular column present in a particular table is visually represented as a rectangular graphical object of a particular color, wherein the particular color is indicative of any of: level of usage of the particular column; or a category of operation associated with usage of the particular column.
Claim: 11. The method of claim 6 , wherein for a particular node representative of a particular table, the listed column identifiers representative of columns present in the particular table are ordered based on relative usage.
Claim: 12. The method of claim 7 , wherein a particular column identifier representative of a particular column includes any of: a name of the particular column; an indication of a category of operation associated with usage of the particular column; or statistical information regarding usage of the particular column.
Claim: 13. The method of claim 6 , wherein the information regarding usage of the columns includes an indication of a category of operation associated with usage of the columns, the category of usage selected from the following categories: “select,” “filter,” “group by,” or “order by.”
Claim: 14. The method of claim 2 , further comprising: causing display of an option to select from a plurality of categories of operation associated with usage of the stored data; receiving a user selection of a particular category of operation via the option; and dynamically updating display of the usage diagram in response to the selected particular category of operation.
Claim: 15. The method of claim 6 , further comprising: causing display of statistical information regarding usage of a particular column in response to detecting a user interaction with a particular column identifier in the usage heatmap diagram that is representative of the particular column.
Claim: 16. The method of claim 2 , wherein the interactive visualization further includes a join diagram, the join diagram including: a plurality of table nodes representative of a plurality of tables present in the stored data, each of the plurality of table nodes visually linked to one or more of the other plurality of table nodes based on inferred join relationships; wherein a particular inferred join relationship between a particular two or more of the plurality of table nodes is indicative of a level of query operations joining the particular two or more tables represented by the two or more table nodes as observed in the workload data.
Claim: 17. The method of claim 16 , further comprising: dynamically updating the join diagram to highlight the particular two or more nodes associated with a particular inferred join relationship in response to a user interaction with any of the particular two or more nodes.
Claim: 18. The method of claim 16 , further comprising: dynamically updating the join diagram to cause display of join key and/or a category of join operation associated with a particular inferred join relationship in response to a user interaction with any of the particular two or more nodes.
Claim: 19. The method of claim 2 , wherein the interactive visualization further includes a column family diagram, the column family diagram including: a graphical object representative of a table present in the stored data, the graphical object including a listing of column identifiers representative of columns present in the table and related to other columns by inferred aggregation relationships; wherein a particular inferred aggregation relationship between a particular two or more columns is indicative of a level of query operations aggregating the values of the particular two or more columns as observed in the workload data.
Claim: 20. The method of claim 19 , wherein the query operations aggregating values of the particular two or more columns include a “group by” query operation.
Claim: 21. The method of claim 2 , wherein the interactive visualization further includes a filter diagram, the filter diagram including: a graphical object representative of a data structure associated with the stored data; and a line bisecting the graphical object, the lines indicative of a filter operation applied to the data structure as observed in the workload data.
Claim: 22. The method of claim 21 , wherein the line bisecting the graphical object includes a value identifier indicative of a value associated with the filter operation.
Claim: 23. The method of claim 2 , wherein the interactive visualization further includes a view lineage diagram, the view lineage diagram including: a view node representative of a view upon which a query was executed at the data storage system; and plurality of table nodes representative of tables upon which the view depends, each of the plurality of table nodes visually linked to the view node.
Claim: 24. The method of claim 2 , wherein the interactive visualization further includes a data flow diagram, the data flow diagram including: a plurality of table nodes representative of a plurality of tables present in the stored data; a query node representative of one or more queries executed on the plurality of tables, the query node visually linked to the plurality of table nodes; and a transient table node representative of one or more transient tables generated as part of a processing pipeline during execution of one or more queries, the transient table nodes visually linked to the query node.
Claim: 25. The method of claim 24 , wherein the data flow diagram further includes: a second query node representative of one or more queries executed on the one or more transient tables, the second query node visually linked to the transient table node.
Claim: 26. The method of claim 2 , wherein the interactive visualization further includes a denormalization diagram, the denormalization diagram including: a plurality of table nodes representative of a plurality of tables present in the stored data, each of the plurality of table nodes visually linked to one or more of the other plurality of table nodes based on inferred join relationships; wherein a particular inferred join relationship between a particular subset of the plurality of table nodes is indicative of a level of query operations joining the particular subset of tables represented by the two or more table nodes as observed in the workload data; and a denormalization recommendation to combine the particular subset of table nodes into fewer tables using an alternative data storage framework.
Claim: 27. The method of claim 26 , wherein denormalization recommendation includes an option to generate a code template for use in migrating stored data associated with the particular subset of table nodes into the alternative data storage framework.
Claim: 28. The method of claim 26 , wherein the alternative data storage framework is based on any of Hadoop, Impala, or Hive.
Claim: 29. The method of claim 2 , wherein the interactive visualization further includes a workload distribution diagram, the workload distribution diagram including: a visual representation of a measure of the queries indicated in the workload data split according to one or more of the following categories: originating user; originating application; originating report; query complexity; or compatibility with a target alternative data storage framework.
Claim: 30. The method of claim 2 , the interactive visualization includes a plurality of insights displayed via an interactive accordion box interface element, wherein each interface element is collapsible and expandable in response to user interaction, and wherein one or more of the interface elements display any of: the usage diagram; a join diagram; a column family diagram; a filter diagram; a view lineage diagram; a data flow diagram; a denormalization diagram; or a workload distribution diagram.
Claim: 31. A system for facilitating analysis of the usage of data stored in a data storage system without accessing the stored data, the system comprising: a processing unit; and a memory coupled to the processing unit, the memory having instructions stored thereon, which when executed by the processing unit, cause the system to: receive workload data indicative of queries executed at the data storage system on the stored data; process the workload data to identify data elements associated with the stored data; process the workload data to determine relationships between the identified plurality of data elements based on operations performed during execution of the queries; and cause display of an interactive visualization including a usage diagram based on the processing, the usage diagram including: a plurality of nodes representative of the plurality of data elements present in the stored data, each of the plurality of nodes visually linked to one or more of the other plurality of nodes based on the determined relationships between the plurality of data elements; wherein each of the plurality of nodes is displayed as a graphical object indicative of a relative usage of a corresponding data element.
Patent References Cited: 6957225 October 2005 Zait
2001/0044795 November 2001 Cohen
2003/0037034 February 2003 Daniels
2005/0055357 March 2005 Campbell
2011/0196857 August 2011 Chen
2013/0080416 March 2013 Meade
2013/0159333 June 2013 Assam
2015/0178366 June 2015 Farahbod
2016/0224616 August 2016 Beacom
Primary Examiner: Morrison, Jay A
Attorney, Agent or Firm: Perkins Coie LLP
Accession Number: edspgr.10255335
Database: USPTO Patent Grants
Description
Abstract:Techniques are described for analyzing usage of data stored in a data storage system without accessing the stored data. In some embodiments, workload data indicative of queries executed at the data storage system on stored data is received. This workload data can include query logs generated during execution of the queries. The workload data is processed to identify data elements such as tables, columns, and views associated with the stored data as well as information regarding usage of the identified data elements. Usage can include operations performed on the data elements during execution of the queries. Based on this processing relationships between the identified data elements can be inferred and visualizations generated that convey information regarding usage of the data stored at the data storage system. Visualizations can include, among others, usage heatmap diagrams, join diagrams, column family diagrams, filter diagrams, view lineage diagrams, data flow diagrams, denormalization diagrams, and workload distribution diagrams.