Hadoop Ecosystem Explained

횹횹 2024. 8. 4. 14:39

2024. 8. 4. 14:39

Summary

HDFS (Hadoop Distributed File System): Data storage

- Stores differenct formats of data on various machines

- 2 major components: Namenode(Master), Datanode(Slave)

- Splits the data into multiple blocks (128MB by default)

YARN (Yet Another Resource Negotiator): Cluster reource management

- Handles the cluster of nodes

- Allocates RAM, memory and other resources to differenct applications

- 2 major components: ResourceManager(Master), NodeManager(Slave)

MapReduce: Data Processing

- MapReduce processes large volumes of data in a parallelly distributed manner

Sqoop, Flume: Data collection and ingestion

- Sqoop is used to transfer data between Hadoop and external datastores such as relational databases and enterprise data warehouses

- Flume is distributed service for collecting, aggregating and moving large amounts of log data

Pig, Hive: Scripting and SQL

- Pig is used to analyze data in Hadoop. It provides a high level data processing language to perform numerous operations on the data

- Hive facilitates reading, writing and managing large datasets residing in the distributed storage using SQL (Hive Query Language)

- 2 Major components: Hive Command Line, JDBC/ODBC driver

- Provides User Defined Functions (UDF) for data mining, document indexing, log processing, etc.

Spark: Real time data analysis

- Spark is an open-source distributed computing engine for processing and analyzing huge volume of real time data (written in Scala)

- Runs 100x times faster than MapReduce

- Provides in-memory computation of data

- Used to process and analyze real time streaming data such as stock market and banking data

mahout: Machine Learning

- alternative: PySpark

Ambari: An Apache open-source tool responsible for keeping track of running applications and their statuses

Kafka, Storm: Streaming

- Kafka is a distributed streaming platform to store and process streams of records (written in Scala, Java)

- Builds real-time streaming data pipelines that reliably get data between applications

- Builds real-time streaming applications that transforms data into streams

- Kafka uses a messaging system for transferring data from one application to another

- Storm is a processing engine that processes real-time streaming data at a very high speed (written in clojure)

- Ability to process over a milion jobs in a fraction of seconds on a node

- It is integrated with Hadoop to harness higher throughputs

Ranger, Knox: Security

Oozie: Workflow system

- Workflow scheduler system to manager Hadoop jobs

[빅데이터를 지탱하는 기술] Ch2. 빅데이터의 탐색 (0)	2025.03.12
[빅데이터를 지탱하는 기술] Ch1. 빅데이터의 기초 지식 (2)	2025.02.25
[12주차] Airflow, EMR 스터디 (0)	2022.05.08
[AWS] Ubuntu EC2에서 python3 version 변경하기 (0)	2022.04.30
[AWS] VS Code에서 EC2 서버 접속하기 (0)	2022.04.30

한 칸씩 쌓기