Hands-On Data Preprocessing in Python Learn how to effectively prepare data for successful data analytics

Get your raw data cleaned up and ready for processing to design better data analytic solutions Key Features Develop the skills to perform data cleaning, data integration, data reduction, and data transformationMake the most of your raw data with powerful data transformation and massaging techniquesP...

Celý popis

Uloženo v:

Podrobná bibliografie
Hlavní autor:	Jafari, Roy
Médium:	E-kniha
Jazyk:	angličtina
Vydáno:	Birmingham Packt Publishing 2022 Packt Publishing, Limited Packt Publishing Limited
Vydání:	1
Témata:	COM018000 COMPUTERS / Data Processing COMPUTERS / Data Visualization COMPUTERS / Database Management / Data Warehousing Computing and Processing Electronic data processing Python (Computer program language)
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Obsah:

Table of Contents Review of the Core Modules of NumPy and PandasReview of Another Core Module - MatplotlibData – What Is It Really?DatabasesData VisualizationPredictionClassificationClustering AnalysisData Cleaning Level I - Cleaning Up the TableData Cleaning Level II - Unpacking, Restructuring, and Reformulating the TableData Cleaning Level III- Missing Values, Outliers, and ErrorsData Fusion and Data IntegrationData ReductionData Transformation and MassagingCase Study 1 - Mental Health in TechCase Study 2 - Predicting COVID-19 HospitalizationsCase Study 3: United States Counties Clustering AnalysisSummary, Practice Case Studies, and Conclusions
Cover -- Copyright -- Contributors -- Table of Contents -- Preface -- Part 1: Technical Needs -- Chapter 1: Review of the Core Modules of NumPy and Pandas -- Technical requirements -- Overview of the Jupyter Notebook -- Are we analyzing data via computer programming? -- Overview of the basic functions of NumPy -- The np.arange() function -- The np.zeros() and np.ones() functions -- The np.linspace() function -- Overview of Pandas -- Pandas data access -- Boolean masking for filtering a DataFrame -- Pandas functions for exploring a DataFrame -- Pandas applying a function -- The Pandas groupby function -- Pandas multi-level indexing -- Pandas pivot and melt functions -- Summary -- Exercises -- Chapter 2: Review of Another Core Module - Matplotlib -- Technical requirements -- Drawing the main plots in Matplotlib -- Summarizing numerical attributes using histograms or boxplots -- Observing trends in the data using a line plot -- Relating two numerical attributes using a scatterplot -- Modifying the visuals -- Adding a title to visuals and labels to the axis -- Adding legends -- Modifying ticks -- Modifying markers -- Subplots -- Resizing visuals and saving them -- Resizing -- Saving -- Example of Matplotilb assisting data preprocessing -- Summary -- Exercises -- Chapter 3: Data - What Is It Really? -- Technical requirements -- What is data? -- Why this definition? -- DIKW pyramid -- Data preprocessing for data analytics versus data preprocessing for machine learning -- The most universal data structure - a table -- Data objects -- Data attributes -- Types of data values -- Analytics standpoint -- Programming standpoint -- Information versus pattern -- Understanding everyday use of the word "information -- Statistical use of the word "information -- Statistical meaning of the word "pattern -- Summary -- Exercises -- References
Chapter 14: Data Transformation and Massaging -- Technical requirements -- The whys of data transformation and massaging -- Data transformation versus data massaging -- Normalization and standardization -- Binary coding, ranking transformation, and discretization -- Example one - binary coding of nominal attribute -- Example two - binary coding or ranking transformation of ordinal attributes -- Example three - discretization of numerical attributes -- Understanding the types of discretization -- Discretization - the number of cut-off points -- A summary - from numbers to categories and back -- Attribute construction -- Example - construct one transformed attribute from two attributes -- Feature extraction -- Example - extract three attributes from one attribute -- Example - Morphological feature extraction -- Feature extraction examples from the previous chapters -- Log transformation -- Implementation - doing it yourself -- Implementation - the working module doing it for you -- Smoothing, aggregation, and binning -- Smoothing -- Aggregation -- Binning -- Summary -- Exercise -- Part 4: Case Studies -- Chapter 15: Case Study 1 - Mental Health in Tech -- Technical requirements -- Introducing the case study -- The audience of the results of analytics -- Introduction to the source of the data -- Integrating the data sources -- Cleaning the data -- Detecting and dealing with outliers and errors -- Detecting and dealing with missing values -- Analyzing the data -- Analysis question one - is there a significant difference between the mental health of employees across the attribute of gender? -- Analysis question two - is there a significant difference between the mental health of employees across the Age attribute? -- Analysis question three - do more supportive companies have mentally healthier employees?
Example of detecting missing values -- Causes of missing values -- Types of missing values -- Diagnosis of missing values -- Dealing with missing values -- Outliers -- Detecting outliers -- Dealing with outliers -- Errors -- Types of errors -- Dealing with errors -- Detecting systematic errors -- Summary -- Exercises -- Chapter 11: Data Fusion and Data Integration -- Technical requirements -- What are data fusion and data integration? -- Data fusion versus data integration -- Directions of data integration -- Frequent challenges regarding data fusion and integration -- Challenge 1 - entity identification -- Challenge 2 - unwise data collection -- Challenge 3 - index mismatched formatting -- Challenge 4 - aggregation mismatch -- Challenge 5 - duplicate data objects -- Challenge 6 - data redundancy -- Example 1 (challenges 3 and 4) -- Example 2 (challenges 2 and 3) -- Example 3 (challenges 1, 3, 5, and 6) -- Checking for duplicate data objects -- Designing the structure for the result of data integration -- Filling songIntegrate_df from billboard_df -- Filling songIntegrate_df from songAttribute_df -- Filling songIntegrate_df from artist_df -- Checking for data redundancy -- The analysis -- Example summary -- Summary -- Exercise -- Chapter 13: Data Reduction -- Technical requirements -- The distinction between data reduction and data redundancy -- The objectives of data reduction -- Types of data reduction -- Performing numerosity data reduction -- Random sampling -- Stratified sampling -- Random over/undersampling -- Performing dimensionality data reduction -- Linear regression as a dimension reduction method -- Using a decision tree as a dimension reduction method -- Using random forest as a dimension reduction method -- Brute-force computational dimension reduction -- PCA -- Functional data analysis -- Summary -- Exercises
Chapter 4: Databases -- Technical requirements -- What is a database? -- Understanding the difference between a database and a dataset -- Types of databases -- The differentiating elements of databases -- Relational databases (SQL databases) -- Unstructured databases (NoSQL databases) -- A practical example that requires a combination of both structured and unstructured databases -- Distributed databases -- Blockchain -- Connecting to, and pulling data from, databases -- Direct connection -- Web page connection -- API connection -- Request connection -- Publicly shared -- Summary -- Exercises -- Part 2: Analytic Goals -- Chapter 5: Data Visualization -- Technical requirements -- Summarizing a population -- Example of summarizing numerical attributes -- Example of summarizing categorical attributes -- Comparing populations -- Example of comparing populations using boxplots -- Example of comparing populations using histograms -- Example of comparing populations using bar charts -- Investigating the relationship between two attributes -- Visualizing the relationship between two numerical attributes -- Visualizing the relationship between two categorical attributes -- Visualizing the relationship between a numerical attribute and a categorical attribute -- Adding visual dimensions -- Example of a five-dimensional scatter plot -- Showing and comparing trends -- Example of visualizing and comparing trends -- Summary -- Exercise -- Chapter 6: Prediction -- Technical requirements -- Predictive models -- Forecasting -- Regression analysis -- Linear regression -- Example of applying linear regression to perform regression analysis -- MLP -- How does MLP work? -- Example of applying MLP to perform regression analysis -- Summary -- Exercises -- Chapter 7: Classification -- Technical requirements -- Classification models
Analysis question four - does the attitude of individuals toward mental health influence their mental health and their seeking of treatments?
Example of designing a classification model -- Classification algorithms -- KNN -- Example of using KNN for classification -- Decision Trees -- Example of using Decision Trees for classification -- Summary -- Exercises -- Chapter 8: Clustering Analysis -- Technical requirements -- Clustering model -- Clustering example using a two-dimensional dataset -- Clustering example using a three-dimensional dataset -- K-Means algorithm -- Using K-Means to cluster a two-dimensional dataset -- Using K-Means to cluster a dataset with more than two dimensions -- Centroid analysis -- Summary -- Exercises -- Part 3: The Preprocessing -- Chapter 9: Data Cleaning Level I - Cleaning Up the Table -- Technical requirements -- The levels, tools, and purposes of data cleaning - a roadmap to chapters 9, 10, and 11 -- Purpose of data analytics -- Tools for data analytics -- Levels of data cleaning -- Mapping the purposes and tools of analytics to the levels of data cleaning -- Data cleaning level I - cleaning up the table -- Example 1 - unwise data collection -- Example 2 - reindexing (multi-level indexing) -- Example 3 - intuitive but long column titles -- Summary -- Exercises -- Chapter 10: Data Cleaning Level II - Unpacking, Restructuring, and Reformulating the Table -- Technical requirements -- Example 1 - unpacking columns and reformulating the table -- Unpacking FileName -- Unpacking Content -- Reformulating a new table for visualization -- The last step - drawing the visualization -- Example 2 - restructuring the table -- Example 3 - level I and II data cleaning -- Level I cleaning -- Level II cleaning -- Doing the analytics - using linear regression to create a predictive model -- Summary -- Exercises -- Chapter 11: Data Cleaning Level III - Missing Values, Outliers, and Errors -- Technical requirements -- Missing values -- Detecting missing values
Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics