https://youtu.be/p0TdBqIt3fg?si=n9hl3KGaYCpm0Cm_

 

Summary

 

HDFS (Hadoop Distributed File System): Data storage

- Stores differenct formats of data on various machines

- 2 major components: Namenode(Master), Datanode(Slave)

- Splits the data into multiple blocks (128MB by default)

 

YARN (Yet Another Resource Negotiator): Cluster reource management

- Handles the cluster of nodes

- Allocates RAM, memory and other resources to differenct applications

- 2 major components: ResourceManager(Master), NodeManager(Slave)

 

MapReduce: Data Processing

- MapReduce processes large volumes of data in a parallelly distributed manner

 

Sqoop, Flume: Data collection and ingestion

- Sqoop is used to transfer data between Hadoop and external datastores such as relational databases and enterprise data warehouses

- Flume is distributed service for collecting, aggregating and moving large amounts of log data

 

 

Pig, Hive: Scripting and SQL

- Pig is used to analyze data in Hadoop. It provides a high level data processing language to perform numerous operations on the data

- Hive facilitates reading, writing and managing large datasets residing in the distributed storage using SQL (Hive Query Language)

    - 2 Major components: Hive Command Line, JDBC/ODBC driver

    - Provides User Defined Functions (UDF) for data mining, document indexing, log processing, etc.

 

Spark: Real time data analysis

- Spark is an open-source distributed computing engine for processing and analyzing huge volume of real time data (written in Scala)

- Runs 100x times faster than MapReduce

- Provides in-memory computation of data

- Used to process and analyze real time streaming data such as stock market and banking data

 

mahout: Machine Learning

- alternative: PySpark

 

Ambari: An Apache open-source tool responsible for keeping track of running applications and their statuses

 

Kafka, Storm: Streaming

- Kafka is a distributed streaming platform to store and process streams of records (written in Scala, Java)

    - Builds real-time streaming data pipelines that reliably get data between applications

    - Builds real-time streaming applications that transforms data into streams

    - Kafka uses a messaging system for transferring data from one application to another

- Storm is a processing engine that processes real-time streaming data at a very high speed (written in clojure)

    - Ability to process over a milion jobs in a fraction of seconds on a node

    - It is integrated with Hadoop to harness higher throughputs

 

Ranger, Knox: Security

 

Oozie: Workflow system

- Workflow scheduler system to manager Hadoop jobs

 

+ Recent posts