Zobrazit v EDS

GENERALIZED PRODUCTION RULES - N-GRAM FEATURE EXTRACTION FROM ABSTRACT SYNTAX TREES (AST) FOR CODE VECTORIZATION

Uloženo v:

Podrobná bibliografie
Název:	GENERALIZED PRODUCTION RULES - N-GRAM FEATURE EXTRACTION FROM ABSTRACT SYNTAX TREES (AST) FOR CODE VECTORIZATION
Document Number:	20220198294
Datum vydání:	June 23, 2022
Appl. No:	17/131944
Application Filed:	December 23, 2020
Abstrakt:	Herein is resource-constrained feature enrichment for analysis of parse trees such as suspicious database queries. In an embodiment, a computer receives a parse tree that contains many tree nodes. Each tree node is associated with a respective production rule that was used to generate the tree node. Extracted from the parse tree are many sequences of production rules having respective sequence lengths that satisfy a length constraint that accepts at least one fixed length that is greater than two. Each extracted sequence of production rules consists of respective production rules of a sequence of tree nodes in a respective directed tree path of the parse tree having a path length that satisfies that same length constraint. Based on the extracted sequences of production rules, a machine learning model generates an inference. In a bag of rules data structure, the extracted sequences of production rules are aggregated by distinct sequence and duplicates are counted.
Claim:	1. A method comprising: receiving a parse tree that contains a plurality of tree nodes, wherein each tree node of the plurality of tree nodes is associated with a respective production rule that generated the tree node; extracting an extracted plurality of sequences of production rules having respective sequence lengths that satisfy a length constraint that accepts at least one fixed length that is greater than two, wherein each sequence of production rules of the plurality of sequences of production rules consists of respective production rules of a sequence of tree nodes in a respective directed tree path of the parse tree having a path length that satisfies same said length constraint that accepts said at least one fixed length that is greater than two; inferencing, by a machine learning (ML) model, based on the plurality of sequences of production rules having respective sequence lengths that satisfy said length constraint that accepts at least one fixed length that is greater than two.
Claim:	2. The method of claim 1 wherein said extracting said plurality of sequences of production rules comprises extracting said plurality of sequences of production rules having respective sequence lengths that satisfy said length constraint that accepts a plurality of fixed lengths.
Claim:	3. The method of claim 1 wherein: the method further comprises counting a respective frequency of each distinct sequence of production rules in said extracted plurality of sequences of production rules; said ML model inferencing is further based on said frequencies of said distinct sequences of production rules in said extracted plurality of sequences of production rules.
Claim:	4. The method of claim 3 wherein: the method further comprises: identifying a vocabulary plurality of distinct sequences of production rules having respective sequence lengths that satisfy said length constraint in a plurality of known parse trees, generating a feature vector that contains a respective feature for each distinct sequence of production rules in said vocabulary plurality of distinct sequences of production rules, and respectively populating said features of said feature vector with said frequencies of said distinct sequences of production rules in said extracted plurality of sequences of production rules; said ML model inferencing is further based on said feature vector that contains said frequencies of said distinct sequences of production rules in said extracted plurality of sequences of production rules.
Claim:	5. The method of claim 4 further comprising, based on said plurality of known parse trees, performing one selected from the group consisting of: supervised training the ML model and unsupervised training the ML model.
Claim:	6. The method of claim 1 wherein said ML model inferencing based on the plurality of sequences of production rules comprises inferencing based on the plurality of sequences, having said sequence lengths that satisfy said length constraint that accepts said at least one fixed length that is greater than two, of production rules of said parse tree of one selected from the group consisting of: a logic statement, a document object model (DOM), a scripting language script, and natural language.
Claim:	7. The method of claim 6 wherein said ML model inferencing based on the plurality of sequences of production rules comprises inferencing based on the plurality of sequences, having said sequence lengths that satisfy said length constraint that accepts said at least one fixed length that is greater than two, of production rules of said parse tree of said logic statement selected from the group consisting of: a database query, a structured query language (SQL) statement, a scripting language statement, and a statement of a general-purpose programing language.
Claim:	8. The method of claim 1 wherein said extracting said extracted plurality of sequences of production rules of sequences of tree nodes in directed tree paths excludes tree paths that contain tree leaves.
Claim:	9. The method of claim 1 wherein: said extracted plurality of sequences of production rules of sequences of tree nodes in directed tree paths contains a first sequence of production rules of a first sequence of tree nodes and a second sequence of production rules of a second sequence of tree nodes; said first sequence of tree nodes contains said second sequence of tree nodes.
Claim:	10. The method of claim 1 wherein said ML model inferencing based on the plurality of sequences of production rules comprises said ML model detecting that said parse tree is anomalous.
Claim:	11. One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors, cause: receiving a parse tree that contains a plurality of tree nodes, wherein each tree node of the plurality of tree nodes is associated with a respective production rule that generated the tree node; extracting an extracted plurality of sequences of production rules having respective sequence lengths that satisfy a length constraint that accepts at least one fixed length that is greater than two, wherein each sequence of production rules of the plurality of sequences of production rules consists of respective production rules of a sequence of tree nodes in a respective directed tree path of the parse tree having a path length that satisfies same said length constraint that accepts said at least one fixed length that is greater than two; inferencing, by a machine learning (ML) model, based on the plurality of sequences of production rules having respective sequence lengths that satisfy said length constraint that accepts at least one fixed length that is greater than two.
Claim:	12. The one or more non-transitory computer-readable media of claim 11 wherein said extracting said plurality of sequences of production rules comprises extracting said plurality of sequences of production rules having respective sequence lengths that satisfy said length constraint that accepts a plurality of fixed lengths.
Claim:	13. The one or more non-transitory computer-readable media of claim 11 wherein: the instructions further cause counting a respective frequency of each distinct sequence of production rules in said extracted plurality of sequences of production rules; said ML model inferencing is further based on said frequencies of said distinct sequences of production rules in said extracted plurality of sequences of production rules.
Claim:	14. The one or more non-transitory computer-readable media of claim 13 wherein: the instructions further cause: identifying a vocabulary plurality of distinct sequences of production rules having respective sequence lengths that satisfy said length constraint in a plurality of known parse trees, generating a feature vector that contains a respective feature for each distinct sequence of production rules in said vocabulary plurality of distinct sequences of production rules, and respectively populating said features of said feature vector with said frequencies of said distinct sequences of production rules in said extracted plurality of sequences of production rules; said ML model inferencing is further based on said feature vector that contains said frequencies of said distinct sequences of production rules in said extracted plurality of sequences of production rules.
Claim:	15. The one or more non-transitory computer-readable media of claim 14 wherein the instructions further cause based on said plurality of known parse trees, performing one selected from the group consisting of: supervised training the ML model and unsupervised training the ML model.
Claim:	16. The one or more non-transitory computer-readable media of claim 11 wherein said ML model inferencing based on the plurality of sequences of production rules comprises inferencing based on the plurality of sequences, having said sequence lengths that satisfy said length constraint that accepts said at least one fixed length that is greater than two, of production rules of said parse tree of one selected from the group consisting of: a logic statement, a document object model (DOM), a scripting language script, and natural language.
Claim:	17. The one or more non-transitory computer-readable media of claim 16 wherein said ML model inferencing based on the plurality of sequences of production rules comprises inferencing based on the plurality of sequences, having said sequence lengths that satisfy said length constraint that accepts said at least one fixed length that is greater than two, of production rules of said parse tree of said logic statement selected from the group consisting of: a database query, a structured query language (SQL) statement, a scripting language statement, and a statement of a general-purpose programing language.
Claim:	18. The one or more non-transitory computer-readable media of claim 11 wherein said extracting said extracted plurality of sequences of production rules of sequences of tree nodes in directed tree paths excludes tree paths that contain tree leaves.
Claim:	19. The one or more non-transitory computer-readable media of claim 11 wherein: said extracted plurality of sequences of production rules of sequences of tree nodes in directed tree paths contains a first sequence of production rules of a first sequence of tree nodes and a second sequence of production rules of a second sequence of tree nodes; said first sequence of tree nodes contains said second sequence of tree nodes.
Claim:	20. The one or more non-transitory computer-readable media of claim 11 wherein said ML model inferencing based on the plurality of sequences of production rules comprises said ML model detecting that said parse tree is anomalous.
Current International Class:	06; 06; 06; 06; 06
Přístupové číslo:	edspap.20220198294
Databáze:	USPTO Patent Applications

Find this record on USPTO

Popis
Abstrakt:	Herein is resource-constrained feature enrichment for analysis of parse trees such as suspicious database queries. In an embodiment, a computer receives a parse tree that contains many tree nodes. Each tree node is associated with a respective production rule that was used to generate the tree node. Extracted from the parse tree are many sequences of production rules having respective sequence lengths that satisfy a length constraint that accepts at least one fixed length that is greater than two. Each extracted sequence of production rules consists of respective production rules of a sequence of tree nodes in a respective directed tree path of the parse tree having a path length that satisfies that same length constraint. Based on the extracted sequences of production rules, a machine learning model generates an inference. In a bag of rules data structure, the extracted sequences of production rules are aggregated by distinct sequence and duplicates are counted.