Data ingestion with Python cookbook: a practical guide helping you ingest, monitor, and identify errors in the data ingestion process

Deploy your own data ingestion pipeline, orchestrate, and monitor efficiently to prevent loss of data and quality.Key FeaturesImplement the best practices to create a data Ingestion pipeline using python and pySparkAutomate and orchestrate your data pipelines using Apache AirflowBuild a monitoring f...

Celý popis

Uloženo v:
Podrobná bibliografie
Hlavní autor: Esppenchutz, Glaucia
Médium: E-kniha
Jazyk:angličtina
Vydáno: Packt Publishing 09.06.2023
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Deploy your own data ingestion pipeline, orchestrate, and monitor efficiently to prevent loss of data and quality.Key FeaturesImplement the best practices to create a data Ingestion pipeline using python and pySparkAutomate and orchestrate your data pipelines using Apache AirflowBuild a monitoring framework while applying the concept of data observability to your pipelinesBook DescriptionData Ingestion with Python Cookbook brings a practical way to design and apply data ingestion pipelines, providing real-world examples with the most reputed open-source tools available on the market, and bringing enlightenment to questions or obstacles.You will be introduced to designing and working with or without data schemas, and creating monitored pipelines with airflow and data observability principles, while following the best practices. The book will further address the challenges to read different data sources or data formats. You will then progress to gain a broad understanding of the best practices for logging errors, how to identify/solve them, data orchestration, monitoring, and where to store the logs for further consultation.By the end of the book, You will have a complete automated set to start ingesting and monitoring the pipeline, making it easier to plug into the further steps of the ETL process later.What you will learnApply data observability using monitoring toolsAutomate your data ingestion pipelineRead analytical and partitioned data with schema or non-schemaDebug and prevent data loss using efficient monitoring and loggingApply data access policies using a data governance frameworkCreate a data orchestration framework to improve data qualityWho This Book Is ForThis book is for Data Engineers and data enthusiasts who want to have a better understanding of the process of ingesting data using the most popular tools in the open-source community.For more advanced learners, this book takes on the theoretical pillars of Data Governance and gives practical examples of real scenarios frequently seen on a daily basis by data engineers.Table of ContentsGRAPHIC BUNDLE present Introduction to Data IngestionData Access Principals - Accessing your dataData Discovery - Understanding our data before ingestingReading CSV and JSON files and solving problemsIngesting Data from Structured and Unstructured DataBasesUsing PySpark with defined and non-defined schemasIngesting Analytical DataDesigning Monitored Data WorkflowsPutting everything together with AirflowLogging and Monitoring you Data Ingest in AirflowAutomate your Data PipelinesUsing Data Observability to Debug, Error Handling, and Prevent Downtimes