Big Data Pipeline: An Overview of Ingestion and Preparation Tools

Mohammad K. Yaghi, Mohammad Haji, Mohammad Thaher, Ibrahim Kassem

Abstract


As the digital landscape evolves, the exponential growth of data across domains such as healthcare, smart cities, and the Internet of Things (IoT) necessitates advanced tools for efficient data ingestion and preparation. Big Data ingestion involves collecting and transferring data from diverse sources into centralized systems, while preparation ensures that data is cleaned, transformed, and made ready for analysis. This paper presents a comprehensive review of recent research and technologies in Big Data ingestion and preparation, emphasizing the importance of selecting appropriate tools based on project-specific requirements such as data volume, format, and latency. Tools including Apache Kafka, NiFi, Flume, Sqoop, and Spark are critically analyzed for their roles in batch and stream ingestion, real-time processing, and data transformation. The study further explores architectural frameworks, performance metrics, and challenges such as unstructured data handling, real-time governance, and integration complexities. Concluding with emerging trends and research directions, this paper contributes to a better understanding of scalable and adaptive Big Data pipelines in modern data-intensive environments.

Doi: 10.24897/acn.64.68.aasrj720253


References



Full Text: PDF

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.

American Academic & Scholarly Research Journal

Copyright © American Academic & Scholarly Research Journal 2023