Data Ingestion with Python Cookbook A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process

Deploy your data ingestion pipeline, orchestrate, and monitor efficiently to prevent loss of data and quality Key Features Harness best practices to create a Python and PySpark data ingestion pipelineSeamlessly automate and orchestrate your data pipelines using Apache AirflowBuild a monitoring frame...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
1. Verfasser: Esppenchutz, Gláucia
Format: E-Book
Sprache:Englisch
Veröffentlicht: Birmingham Packt Publishing 2023
Packt Publishing, Limited
Packt Publishing Limited
Ausgabe:1
Schlagworte:
ISBN:9781837633098, 1837633096, 183763260X, 9781837632602
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Inhaltsangabe:
  • Table of Contents Introduction to Data IngestionPrincipals of Data Access – Accessing your DataData Discovery – Understanding Our Data Before Ingesting ItReading CSV and JSON Files and Solving ProblemsIngesting Data from Structured and Unstructured DatabasesUsing PySpark with Defined and Non-Defined SchemasIngesting Analytical DataDesigning Monitored Data WorkflowsPutting Everything Together with AirflowLogging and Monitoring Your Data Ingest in AirflowAutomating Your Data Ingestion PipelinesUsing Data Observability for Debugging, Error Handling, and Preventing Downtime
  • There's more… -- Further reading -- Chapter 3: Data Discovery - Understanding Our Data before Ingesting It -- Technical requirements -- Documenting the data discovery process -- Getting ready -- How to do it… -- How it works… -- Configuring OpenMetadata -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Connecting OpenMetadata to our database -- Getting ready -- How to do it… -- How it works… -- Further reading -- Other tools -- Chapter 4: Reading CSV and JSON Files and Solving Problems -- Technical requirements -- Reading a CSV le -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Reading a JSON le -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Creating a SparkSession for PySpark -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Using PySpark to read CSV les -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Using PySpark to read JSON les -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Further reading -- Chapter 5: Ingesting Data from Structured and Unstructured Databases -- Technical requirements -- Configuring a JDBC connection -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Ingesting data from a JDBC database using SQL -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Connecting to a NoSQL database (MongoDB) -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Creating our NoSQL table in MongoDB -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Ingesting data from MongoDB using PySpark -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Further reading
  • There's more… -- See also -- Using noti cation operators -- Getting ready -- How to do it… -- How it works… -- There's more… -- Using SQL operators for data quality -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Further reading -- Chapter 11: Automating Your Data Ingestion Pipelines -- Technical requirements -- Installing and running Airflow -- Scheduling daily ingestions -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Scheduling historical data ingestion -- Getting ready -- How to do it… -- How it works… -- There's more… -- Scheduling data replication -- Getting ready -- How to do it… -- How it works… -- There's more… -- Setting up the schedule_interval parameter -- Getting ready -- How to do it… -- How it works… -- See also -- Solving scheduling errors -- Getting ready -- How to do it… -- How it works… -- There's more… -- Further reading -- Chapter 12: Using Data Observability for Debugging, Error Handling, and Preventing Downtime -- Technical requirements -- Docker images -- Setting up StatsD for monitoring -- Getting ready -- How to do it… -- How it works… -- See also -- Setting up Prometheus for storing metrics -- Getting ready -- How to do it… -- How it works… -- There's more… -- Setting up Grafana for monitoring -- Getting ready -- How to do it… -- How it works… -- There's more… -- Creating an observability dashboard -- Getting ready -- How to do it… -- How it works… -- There's more… -- Setting custom alerts or noti cations -- Getting ready -- How to do it… -- How it works… -- Further reading -- Index -- Other Books You May Enjoy
  • How it works… -- There's more… -- See also -- Creating standardized logs -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Monitoring our data ingest le size -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Logging based on data -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Retrieving SparkSession metrics -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Further reading -- Chapter 9: Putting Everything Together with Air ow -- Technical requirements -- Installing Airflow -- Configuring Airflow -- Getting ready -- How to do it… -- How it works… -- See also -- Creating DAGs -- Getting ready -- How to do it… -- How it works… -- There is more… -- See also -- Creating custom operators -- Getting ready -- How to do it… -- How it works… -- There is more… -- See also -- Con guring sensors -- Getting ready -- How to do it… -- How it works… -- See also -- Creating connectors in Air ow -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Creating parallel ingest tasks -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- De ning ingest-dependent DAGs -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Further reading -- Chapter 10: Logging and Monitoring Your Data Ingest in Airflow -- Technical requirements -- Installing and running Airflow -- Creating basic logs in Air ow -- Getting ready -- How to do it… -- How it works… -- See also -- Storing log files in a remote location -- Getting ready -- How to do it… -- How it works… -- See also -- Configuring logs in airflow.cfg -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Designing advanced monitoring -- Getting ready -- How to do it… -- How it works
  • Cover -- Title Page -- Copyright and Credits -- Dedications -- Contributors -- Table of Contents -- Preface -- Part 1: Fundamentals of Data Ingestion -- Chapter 1: Introduction to Data Ingestion -- Technical requirements -- Setting up Python and its environment -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Installing PySpark -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Configuring Docker for MongoDB -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Configuring Docker for Airflow -- Getting ready -- How to do it… -- How it works… -- See also -- Creating schemas -- Getting ready -- How to do it… -- How it works… -- See also -- Applying data governance in ingestion -- Getting ready -- How to do it… -- How it works… -- See also -- Implementing data replication -- Getting ready -- How to do it… -- How it works… -- There's more… -- Further reading -- Chapter 2: Principals of Data Access - Accessing Your Data -- Technical requirements -- Implementing governance in a data access workflow -- Getting ready -- How to do it… -- How it works… -- See also -- Accessing databases and data warehouses -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Accessing SSH File Transfer Protocol (SFTP) les -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Retrieving data using API authentication -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Managing encrypted les -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Accessing data from AWS using S3 -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Accessing data from GCP using Cloud Storage -- Getting ready -- How to do it… -- How it works
  • Chapter 6: Using PySpark with De ned and Non-De ned Schemas -- Technical requirements -- Applying schemas to data ingestion -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Importing structured data using a well-de ned schema -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Importing unstructured data without a schema -- Getting ready… -- How to do it… -- How it works… -- Ingesting unstructured data with a well-de ned schema and format -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Inserting formatted SparkSession logs to facilitate your work -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Further reading -- Chapter 7: Ingesting Analytical Data -- Technical requirements -- Ingesting Parquet les -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Ingesting Avro files -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Applying schemas to analytical data -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Filtering data and handling common issues -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Ingesting partitioned data -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Applying reverse ETL -- Getting ready -- How to do it… -- How it works… -- There's more… -- See also -- Selecting analytical data for reverse ETL -- Getting ready -- How to do it… -- How it works… -- See also -- Further reading -- Part 2: Structuring the Ingestion Pipeline -- Chapter 8: Designing Monitored Data Work ows -- Technical requirements -- Inserting logs -- Getting ready -- How to do it… -- How it works… -- See also -- Using log-level types -- Getting ready -- How to do it
  • Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process