Handbook of statistical analysis and data mining applications

A comprehensive professional reference book that guides business analysts, scientists, engineers and researchers, both academic and industrial, through all stages of data analysis, model building and implementation. The handbook helps users discern technical and business problems, understand the str...

Celý popis

Uloženo v:

Podrobná bibliografie
Hlavní autoři:	Nisbet, Robert, Miner, Gary, Yale, Ken, Elder, John, Peterson, Andy
Médium:	E-kniha Kniha
Jazyk:	angličtina
Vydáno:	London Academic Press 2018 Elsevier Science & Technology
Vydání:	2
Témata:	Data mining Data mining > Statistical methods Statistical methods Statistics
ISBN:	0124166326, 9780124166325
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Obsah:

Front Cover -- Handbook of Statistical Analysis and Data Mining Applications -- Copyright -- Contents -- List of Tutorials on the Elsevier Companion Web Page -- Foreword 1 for 1st Edition -- Foreword 2 for 1st Edition -- Preface -- Overall Organization of This Book -- Introduction -- Patterns of Action -- Human Intuition -- Putting It All Together -- Frontispiece -- Biographies of the Primary Authors of This Book -- Part I: History of Phases of Data Analysis, Basic Theory, and the Data Mining Process -- Chapter 1: The Background for Data Mining Practice -- Preamble -- Data Mining or Predictive Analytics? -- A Short History of Statistics and Predictive Analytics -- Modern Statistics: A Duality? -- The Contributions of Sir Ronald Fisher -- Two Views of Reality -- Aristotle -- Plato -- The Rise of Modern Statistical Analysis: The Second Generation -- Data, Data Everywhere… -- Machine Learning Methods: The Third Generation -- Statistical Learning Theory: The Fourth Generation -- Reinforced and Deep Learning -- Current Trends of Development in Predictive Analytics -- Postscript -- References -- Chapter 2: Theoretical Considerations for Data Mining -- Preamble -- The Scientific Method -- What Is Data Mining? -- A Theoretical Framework for the Data Mining Process -- Microeconomic Approach -- Inductive Database Approach -- Strengths of the Data Mining Process -- Customer-Centric Versus Account-Centric: A New Way to Look at Your Data -- The Physical Data Mart -- The Virtual Data Mart -- Householded Databases -- The Data Paradigm Shift -- Creation of the Car -- Major Activities of Data Mining -- Major Challenges of Data Mining -- General Examples of Data Mining Applications -- Major Issues in Data Mining -- General Requirements for Success in a Data Mining Project -- Example of a Data Mining Project: Classify a Bat's Species by Its Sound -- Approach
MARSplines and Classification Problems -- Model Selection and Pruning -- MARSplines as a Predictor (Feature) Selection Method -- Applications -- The MARSplines Algorithm -- Statistical Learning Theory: Support Vector Machines -- Kernel Functions -- Sequence, Association, and Link Analyses -- Association Rules -- Sequence Analysis -- Link Analysis -- Association Rule Details -- Sequence Analysis Applications -- Link Analysis-Employing Visualization -- Independent Components Analysis (ICA) -- STATISTICA Fast Independent Component Analysis (FICA) -- Kohonen Networks -- Characteristics of a Kohonen Network -- Quality Control Data Mining and Root Cause Analysis -- Postscript -- References -- Further Reading -- Chapter 9: Classification -- Preamble -- What Is Classification? -- Initial Operations in Classification -- Major Issues With Classification -- What Is the Nature of the Data Set to be Classified? -- How Accurate Does the Classification Have to Be? -- How Understandable Do the Classes Have to Be? -- What Is the Relative Importance of Model Accuracy vs Generality? -- Assumptions of Classification Procedures -- Numerical Variables Operate Best -- No Missing Values -- Variables Are Independent in Their Effects on the Target Variable -- Analyzing Imbalanced Data Sets With Machine Learning Programs -- Phases in the Operation of Classification Algorithms -- Advantages and Disadvantages of Common Classification Algorithms -- Decision Trees -- Random Forests -- Rule Induction -- Chaid -- Nearest-Neighbor Classifiers -- Logistic Regression -- Neural Networks -- Naive Bayesian Classifiers -- Which Algorithm Is Best for Classification? -- Automated Analytics-Is it the Wave of the Future? -- Postscript -- References -- Further Reading -- Chapter 10: Numerical Prediction -- Preamble -- Linear Response Analysis and the Assumptions of the Parametric Model
Parametric Statistical Analysis
The Importance of Domain Knowledge -- Postscript -- Why Did Data Mining Arise? -- Caveats With Data Mining Solutions -- References -- Further Reading -- Chapter 3: The Data Mining and Predictive Analytic Process -- Preamble -- The Science of Data Mining/Predictive Analytics -- The Approach to Understanding and Problem Solving -- Crisp-Dm -- Business Understanding (Mostly Art) -- Define the Business Objectives of the Data Mining Model -- Assess the Business Environment for Data Mining -- Formulate the Analytical Goals and Objectives of the Project -- Data Understanding (Mostly Science) -- Data Acquisition -- Data Integration -- Data Description -- Data Quality Assessment -- Data Preparation (A Mixture of Art and Science) -- Modeling (A Mixture of Art and Science) -- Steps in the Modeling Phase of CRISP-DM -- Deployment (Mostly Art) -- Closing the Information Loop* (Art) -- SEMMA (used by SAS): -- DMAIC (a Six Sigma approach designed primarily for industrial applications): -- The Art of Data Mining -- Artistic Steps in Data Mining -- Deriving New Variables -- Selecting Predictor Variables -- Postscript -- References -- Chapter 4: Data Understanding and Preparation -- Preamble -- Activities of Data Understanding and Preparation -- Issues That Should Be Resolved -- Basic Issues That Must Be Resolved in Data Understanding (See Fig. 3.1 in Chapter 3) -- Basic Issues That Must Be Resolved in Data Preparation -- Data Understanding -- Data Acquisition -- Query-Based Data Extracts -- High-Level Query Languages -- Low-Level and ODBC Database Connections -- Data Extraction -- Data Description -- Data Assessment -- Data Profiling -- Data Cleansing -- Validating Codes Against Lists of Acceptable Values -- Deleting Particularly "Dirty" Records -- Data Transformation -- Numerical Variables -- Categorical Variables -- Accuracy vs Precision -- Data Imputation
Assumption of Missing Completely at Random (MCAR) -- Assumption of Missing at Random (MAR) -- Techniques for Imputing Data -- Maximum Likelihood Imputation -- Multiple Imputation -- Data Filtering and Smoothing -- Removal of Outliers -- Time-Series Filtering -- Low-Pass Filter -- High-Pass Filter -- Band-Pass Filter -- Data Abstractions -- Other Data Abstractions -- Data Reduction -- Reduction of Dimensionality -- Correlation Coefficients -- CHAID (Chi-Square Automatic Interaction Detection) -- Principal Components Analysis (PCA) -- Gini Index -- Graphical Methods -- Data Sampling -- Data Discretization -- Data Derivation -- Assignment or Derivation of the Target Variable -- Derivation of New Predictor Variables -- Attribute-Oriented Induction of Generalization Variables -- Data Conditioning -- Standardization -- Data Set Balancing -- Under-Sampling -- Over-Sampling -- Weights -- Prior Probabilities -- Segmentation -- Postscript -- References -- Further Reading -- Chapter 5: Feature Selection -- Preamble -- Variables as Features -- Types of Feature Selection -- Feature Ranking Methods -- Gini Index -- Bivariate Methods -- Multivariate Methods -- Stepwise Linear Regression -- Partial Least Squares Regression -- Sensitivity Analysis -- Complex Methods -- Multiple Adaptive Regression Splines (MARS) -- Subset Selection Methods -- Why Use Feature Selection? -- Interactive Menus Interface -- DMRecipe Automated Modeling Interface -- Postscript -- References -- Chapter 6: Accessory Tools for Doing Data Mining -- Preamble -- Data Access Tools -- Structured Query Language (SQL) Tools -- Extract, Transform, and Load (ETL) Capabilities -- Data Exploration Tools -- Basic Descriptive Statistics -- Measures of Location -- Measures of Dispersion -- Range -- Measures of Position -- Measures of Shape -- Robust Measures of Location -- Frequency Tables
Combining Groups (Classes) for Predictive Data Mining -- Slicing/Dicing and Drilling Down into Data Sets/Results Spreadsheets -- Modeling Management Tools -- Data Miner Workspace Templates -- Modeling Analysis Tools -- Feature Selection -- Importance Plots of Variables -- In-place Data Processing (IDP) -- Example: The IDP Facility of STATISTICA Data Miner -- How to Use the SQL -- Rapid Deployment of Predictive Models -- Model Monitors -- Postscript -- Further Reading -- Part II: The Algorithms and Methods in Data Mining and Predictive Analytics and Some Domain Areas -- Chapter 7: Basic Algorithms for Data Mining: A Brief Overview -- Preamble -- Introduction -- STATISTICA Data Miner Recipe (DMRecipe) -- Basic Data Mining Algorithms -- Association Rules -- Neural Networks -- How Does Backpropagation Work? -- Training a Neural Net -- Additional Types of Neural Networks -- Radial Basis Function (RBF) Networks -- Advantages of RBFs -- Disadvantages of RBFs -- Automated Neural Nets -- Generalized Additive Models (GAM) -- Outputs of GAMs -- Interpreting Results of GAMs -- Classification and Regression Trees (CART) -- What Is a Decision Tree? -- Recursive Partitioning -- Pruning Trees -- General CHAID Models -- Generalized EM and k-Means Cluster Analysis-An Overview -- k-Means Clustering -- EM Cluster Analysis -- V-Fold Cross-Validation as Applied to Clustering -- Postscript -- References -- Further Reading -- Chapter 8: Advanced Algorithms for Data Mining -- Preamble -- Introduction -- Advanced Data Mining Algorithms -- Interactive Trees -- Manually Building the Tree -- The Tree Browser -- Advantages of I-Trees -- Building Trees Interactively -- Combining Techniques -- Multivariate Adaptive Regression Splines (MARSplines) -- Basis Functions -- The MARSplines Model -- Categorical Predictors -- Multiple Dependent (Outcome) Variables