Generalized production rules—N-gram feature extraction from abstract syntax trees (AST) for code vectorization
Uloženo v:
| Název: | Generalized production rules—N-gram feature extraction from abstract syntax trees (AST) for code vectorization |
|---|---|
| Patent Number: | 12026,631 |
| Datum vydání: | July 02, 2024 |
| Appl. No: | 17/131944 |
| Application Filed: | December 23, 2020 |
| Abstrakt: | Herein is resource-constrained feature enrichment for analysis of parse trees such as suspicious database queries. In an embodiment, a computer receives a parse tree that contains many tree nodes. Each tree node is associated with a respective production rule that was used to generate the tree node. Extracted from the parse tree are many sequences of production rules having respective sequence lengths that satisfy a length constraint that accepts at least one fixed length that is greater than two. Each extracted sequence of production rules consists of respective production rules of a sequence of tree nodes in a respective directed tree path of the parse tree having a path length that satisfies that same length constraint. Based on the extracted sequences of production rules, a machine learning model generates an inference. In a bag of rules data structure, the extracted sequences of production rules are aggregated by distinct sequence and duplicates are counted. |
| Inventors: | Oracle International Corporation (Redwood Shores, CA, US) |
| Assignees: | Oracle International Corporation (Redwood Shores, CA, US) |
| Claim: | 1. A method comprising: receiving a parse tree that contains a plurality of tree nodes, wherein each tree node of the plurality of tree nodes is associated with a respective production rule that generated the tree node; extracting an extracted plurality of sequences of production rules having respective sequence lengths that satisfy a predefined length constraint that only accepts at least one fixed length that is greater than two, wherein each sequence of production rules of the extracted plurality of sequences of production rules consists of respective production rules of a sequence of tree nodes in a respective directed tree path of the parse tree having a path length that satisfies same said predefined length constraint that only accepts said at least one fixed length that is greater than two; and inferencing from the parse tree, by a machine learning (ML) model, based on the extracted plurality of sequences of production rules having respective sequence lengths that satisfy said predefined length constraint that only accepts said at least one fixed length that is greater than two. |
| Claim: | 2. The method of claim 1 wherein said extracting said extracted plurality of sequences of production rules comprises extracting said extracted plurality of sequences of production rules having respective sequence lengths that satisfy said predefined length constraint that only accepts said at least one fixed length. |
| Claim: | 3. The method of claim 1 wherein: the method further comprises counting a respective frequency of each distinct sequence of production rules in said extracted plurality of sequences of production rules; said ML model inferencing is further based on said frequencies of said distinct sequences of production rules in said extracted plurality of sequences of production rules. |
| Claim: | 4. The method of claim 3 wherein: the method further comprises: identifying a vocabulary plurality of distinct sequences of production rules having respective sequence lengths that satisfy said predefined length constraint in a plurality of known parse trees, generating a feature vector that contains a respective feature for each distinct sequence of production rules in said vocabulary plurality of distinct sequences of production rules, and respectively populating said features in said feature vector with said frequencies of said distinct sequences of production rules in said extracted plurality of sequences of production rules; said ML model inferencing is further based on said feature vector that contains said frequencies of said distinct sequences of production rules in said extracted plurality of sequences of production rules. |
| Claim: | 5. The method of claim 4 further comprising, based on said plurality of known parse trees, performing one selected from a group consisting of: supervised training the ML model and unsupervised training the ML model. |
| Claim: | 6. The method of claim 1 wherein said ML model inferencing based on the extracted plurality of sequences of production rules comprises inferencing based on the plurality of sequences, having said sequence lengths that satisfy said predefined length constraint that only accepts said at least one fixed length that is greater than two, of production rules of said parse tree of one selected from a group consisting of: a logic statement, a document object model (DOM), a scripting language script, and natural language. |
| Claim: | 7. The method of claim 6 wherein said ML model inferencing based on the extracted plurality of sequences of production rules comprises inferencing based on the plurality of sequences, having said sequence lengths that satisfy said predefined length constraint that only accepts said at least one fixed length that is greater than two, of production rules of said parse tree of said logic statement selected from a group consisting of: a database query, a structured query language (SQL) statement, a scripting language statement, and a statement of a general-purpose programing language. |
| Claim: | 8. The method of claim 1 wherein said extracting said extracted plurality of sequences of production rules of sequences of tree nodes in directed tree paths excludes tree paths that contain tree leaves. |
| Claim: | 9. The method of claim 1 wherein: said extracted plurality of sequences of production rules of sequences of tree nodes in directed tree paths contains a first sequence of production rules of a first sequence of tree nodes and a second sequence of production rules of a second sequence of tree nodes; said first sequence of tree nodes contains said second sequence of tree nodes. |
| Claim: | 10. The method of claim 1 wherein said ML model inferencing based on the extracted plurality of sequences of production rules comprises said ML model detecting that said parse tree is anomalous. |
| Claim: | 11. One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors, cause: receiving a parse tree that contains a plurality of tree nodes, wherein each tree node of the plurality of tree nodes is associated with a respective production rule that generated the tree node; extracting an extracted plurality of sequences of production rules having respective sequence lengths that satisfy a predefined length constraint that only accepts at least one fixed length that is greater than two, wherein each sequence of production rules of the extracted plurality of sequences of production rules consists of respective production rules of a sequence of tree nodes in a respective directed tree path of the parse tree having a path length that satisfies same said predefined length constraint that only accepts said at least one fixed length that is greater than two; and inferencing from the parse tree, by a machine learning (ML) model, based on the extracted plurality of sequences of production rules having respective sequence lengths that satisfy said predefined length constraint that only accepts said at least one fixed length that is greater than two. |
| Claim: | 12. The one or more non-transitory computer-readable media of claim 11 wherein said extracting said extracted plurality of sequences of production rules comprises extracting said extracted plurality of sequences of production rules having respective sequence lengths that satisfy said predefined length constraint that only accepts said at least one fixed length. |
| Claim: | 13. The one or more non-transitory computer-readable media of claim 11 wherein: the instructions further cause counting a respective frequency of each distinct sequence of production rules in said extracted plurality of sequences of production rules; said ML model inferencing is further based on said frequencies of said distinct sequences of production rules in said extracted plurality of sequences of production rules. |
| Claim: | 14. The one or more non-transitory computer-readable media of claim 13 wherein: the instructions further cause: identifying a vocabulary plurality of distinct sequences of production rules having respective sequence lengths that satisfy said predefined length constraint in a plurality of known parse trees, generating a feature vector that contains a respective feature for each distinct sequence of production rules in said vocabulary plurality of distinct sequences of production rules, and respectively populating said features in said feature vector with said frequencies of said distinct sequences of production rules in said extracted plurality of sequences of production rules; said ML model inferencing is further based on said feature vector that contains said frequencies of said distinct sequences of production rules in said extracted plurality of sequences of production rules. |
| Claim: | 15. The one or more non-transitory computer-readable media of claim 14 wherein the instructions further cause based on said plurality of known parse trees, performing one selected from a group consisting of: supervised training the ML model and unsupervised training the ML model. |
| Claim: | 16. The one or more non-transitory computer-readable media of claim 11 wherein said ML model inferencing based on the extracted plurality of sequences of production rules comprises inferencing based on the plurality of sequences, having said sequence lengths that satisfy said predefined length constraint that only accepts said at least one fixed length that is greater than two, of production rules of said parse tree of one selected from a group consisting of: a logic statement, a document object model (DOM), a scripting language script, and natural language. |
| Claim: | 17. The one or more non-transitory computer-readable media of claim 16 wherein said ML model inferencing based on the extracted plurality of sequences of production rules comprises inferencing based on the plurality of sequences, having said sequence lengths that satisfy said predefined length constraint that only accepts said at least one fixed length that is greater than two, of production rules of said parse tree of said logic statement selected from a group consisting of: a database query, a structured query language (SQL) statement, a scripting language statement, and a statement of a general-purpose programing language. |
| Claim: | 18. The one or more non-transitory computer-readable media of claim 11 wherein said extracting said extracted plurality of sequences of production rules of sequences of tree nodes in directed tree paths excludes tree paths that contain tree leaves. |
| Claim: | 19. The one or more non-transitory computer-readable media of claim 11 wherein: said extracted plurality of sequences of production rules of sequences of tree nodes in directed tree paths contains a first sequence of production rules of a first sequence of tree nodes and a second sequence of production rules of a second sequence of tree nodes; said first sequence of tree nodes contains said second sequence of tree nodes. |
| Claim: | 20. The one or more non-transitory computer-readable media of claim 11 wherein said ML model inferencing based on the extracted plurality of sequences of production rules comprises said ML model detecting that said parse tree is anomalous. |
| Patent References Cited: | 20140101117 April 2014 Uzzaman 20180060211 March 2018 Allen 20180096144 April 2018 Pan 20190018614 January 2019 Balko 20190188212 June 2019 Miller 20200045049 February 2020 Apostolopoulos 20200076840 March 2020 Peinador 2128798 December 2009 |
| Other References: | Peinador, U.S. Appl. No. 16/122,398, filed Sep. 5, 2018, Notice of Allowance, dated Apr. 29, 2021. cited by applicant Peinador, U.S. Appl. No. 16/122,398, Final Office Action, dated Jan. 28, 2021. cited by applicant He et al., “A Reusable SQL Injection Detection Method for Java Web Applications”, KSII Transactions On Internet and Information Systems vol. 14, No. 6, dated Jun. 2020, 15 pages. cited by applicant Alon et al., “A General Path-Based Representation for Predicting Program Properties”, dated Apr. 22, 2018, 16 pages. cited by applicant Zhang et al., “A Novel Neural Source Code Representation based on Abstract Syntax Tree”, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 2019, 12 pages. cited by applicant Sharma et al., “A Survey on Machine Learning Techniques for Source Code Analysis”, dated 2021, 59 pages. cited by applicant Rabinovich et al., “Abstract Syntax Networks for Code Generation and Semantic Parsing”, dated 2017, 11 pages. cited by applicant Ndichu et al., “A machine learning approach to detection of JavaScript-based attacks using AST features and paragraph vectors”, Applied Soft Computing 84, 2019, 11 pages. cited by applicant Follenfant et al., “SQL Abstract Syntax Trees Vocabulary”, available: https://ns.inria.fr/ast/sql/index.html, Jan. 2014, 26 pages. cited by applicant Radford, Alec et al., “Improving language understanding by generative pre-training”, 2018. cited by applicant Radford, Alec et al., “Language models are unsupervised multitask learners”, Openai Blog, 1(8):9, 2019. cited by applicant Kanade, Aditya et al., “Learning and Evaluating Contextual Embedding of Source Code”, Proceedings of the 37th International Conference On Machine Learning, Jul. 13-18, 2020, vol. 119 of Proceedings of Machine Learning Research, pp. 5110-5121. cited by applicant Devlin, Jacob et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, Proceedings of The 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Jun. 2-7, 2019, 4171-4186. cited by applicant Alon, Uri et al., “code2vec: learning distributed representations of code”, Proc. ACM Program. Lang., 3(POPL):40:1-40:29, 2019. cited by applicant Alon, Uri et al., “code2seq: Generating sequences from structured representations of code”, 7th International Conference On Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, OpenReview.net, 2019. cited by applicant Peinador, U.S. Appl. No. 16/122,398, filed Sep. 5, 2018, Office Action, dated Oct. 28, 2020. cited by applicant Jimenez et al., “On the Impact of Tokenizer and Parameters on N-Gram Based Code Analysis”, dated 2018, 12 pages. cited by applicant An et al., “Variational Autoencoder based Anomaly Detection Using Reconstruction Probability”, 2015-2 Special Lecture on IE, dated Dec. 27, 2015, 18 pages. cited by applicant Apel et al., “Learning SQL for Database Intrusion Detection using Context-sensitive Modelling”, dated 2009, 33 pages. cited by applicant Bengio et al., “Learning Deep Architectures for AI”, dated 2009, 71 pages. cited by applicant Bockermann et al., “Learning SQL for Database Intrusion Detection Using Context-Sensitive Modelling”, DIMVA dated 2009, LNCS 5587, 10 pages. cited by applicant Cai et al., “An Abstract Syntax Tree Encoding Method for Cross-Project Defect Prediction”, IEEE, dated Nov. 15, 2019, 10 pages. cited by applicant Ding et al., “PCA-Based Network Traffic Anomaly Detection” Tsinghua Science and Technology, vol. 21, No. 5, Oct. 2016, 10 pages. cited by applicant Garcia-Duran et al. “Learning Graph Representations with Embedding Propagation”, 31st Conference on Neural Information Processing Systems (NIPS 2017), 12 pages. cited by applicant Gibert et al., “Graph Embedding in Vector Spaces”, GbR'2011 Mini-tutorial, dated 2011, 66 pages. cited by applicant Grover et al., “node2vec: Scalable Feature Learning for Networks”, KDD '16, Aug. 13-17, 2016, San Francisco, CA, USA, 10 pages. cited by applicant Hamilton et al., “Inductive Representation Learning on Large Graphs”, 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 11 pages. cited by applicant Huang et al., “Online System Problem Detection by Mining Patterns of Console Logs”, dated Dec. 2009, 11 pages. cited by applicant Zhang et al., “Automated IT System Failure Prediction: A Deep Learning Approach”, dated 2016, 11 pages. cited by applicant Liu et al., “Isolation Forest”, dated 2008, 10 pages. cited by applicant Maglaras et al. “A real time OCSVM Intrusion Detection module with low overhead for SCADA systems”, International Journal of Advanced Research in Artificial Intelligence, vol. 3, No. 10, 2014, 9 pgs. cited by applicant Mou et al., “Building Program Vector Representations for Deep Learning”, dated Sep. 11, 2014, 11 pages. cited by applicant Perozzi et al., “DeepWalk: Online Learning of Social Representations”, dated 2014, 10 pages. cited by applicant Scholkopf et al. “Estimating the Support of a High-Dimensional Distribution”, dated Nov. 27, 1999, 28 pages. cited by applicant Wei et al., “Graph embedding based feature selection”, Neurocomputing 93 dated May 17, 2012, 11 pages. cited by applicant Xu et al. “Detecting Large-Scale System Problem by Mining Console Logs”, SOSP'09, Oct. 11-14, 2009, 15 pages. cited by applicant Yamaguchi et al., “Generalized Vulnerability Extrapolation using Abstract Syntax Trees”, ACSAC '12 Dec. 3-7, 2012, Orlando, Florida USA, 10 pages. cited by applicant Yamanishi et al., “Dynamic Syslog Mining for Network Failure Monitoring”, KDD'Aug. 21-24, 05, 2005, Chicago, Illinois, USA, 10 pages. cited by applicant Yen et al., “Beehive: Large-Scale Log Analysis for Detecting Suspicious Activity in Enterprise Networks”, ACSAC '13 dated Dec. 9-13, 2013, New Orleans, Louisiana, USA, 10 pages. cited by applicant Zhang et al., “Network Anomaly Detection Using One Class Support Vector Machine”, Proceedings of the International MultiConference of Engineers and Computer Scientists 2008 vol. I, Mar. 19, 2008, 5 pages. cited by applicant Hamilton et al., “Representation Learning on Graphs: Methods and Applications”, Copyright 2017 IEEE, 24 pages. cited by applicant Zhang, Jian, et al., “A Novel Neural Source Code Representation Based on Abstract Syntax Tree”, 2019 IEEE/ACM 41st ICSE, doi: 10.1109/ICSE.2019.00086, pp. 783-794, May 25, 2019, 12pgs. cited by applicant Zeng, Jie, et al., “Fast Code Clone Detection Based on Weighted Recursive Autoencoders”, IEEE Access, vol. 7; pp. 125062-125078, 2019. doi: 10.1109/ACCESS.2019.2938825, published Sep. 2, 2019, 17pgs. cited by applicant White, Martin, et al., “Deep Learning Code Fragments for Code Clone Detection”, 31st IEEE/ACM ICASE 2016, dx.doi.org/10.1145/2970276.2970326, pp. 87-98, publ Aug. 25, 2016, 12pgs. cited by applicant Parr, Terence, “The Definitive ANTLR 4 Reference”, The Pragmatic Bookshelf, copyright 2012The Pragmatic Programmers, LLC, www.it-ebooks.info, book version Jan. 2013, 322pgs. cited by applicant Jiang, Lingxiao, et al., “Deckard: Scalable and Accurate Tree-based Detection of Code Clones”, 29th ICSE '07, doi: 10.1109/ICSE.2007.30, pp. 96-105, May 24, 2007, 10pgs. cited by applicant Gao, Yi, et al., “TECCD: A Tree Embedding Approach for Code Clone Detection”, 2019 IEEE CSME, doi: 10.1109/ICSME.2019.00025, pp. 145-156, 2019, 12pgs. cited by applicant Fang, Chunrong, et al., “Functional code clone detection with syntax and semantics fusion learning”, Proc of the 29th ACM SIGSOFT ISSTA '20, pp. 516-527, doi: 10.1145/3395363.3397362, Jul. 18, 2020, 12pgs. cited by applicant |
| Primary Examiner: | Kang, Insun |
| Attorney, Agent or Firm: | Hickman Becker Bingham Ledesma LLP Miller, Brian |
| Přístupové číslo: | edspgr.12026631 |
| Databáze: | USPTO Patent Grants |
Buďte první, kdo okomentuje tento záznam!