+91 90691 39140 | +1 253 214 3115 | info@hub4tech.com | hub4tech

YARN Overview

YARN Overview

  • YARN = Yet Another Resource Negotiator
  • Is NOT ONLY MapReduce 2.0, but also…
  • Framework to develop and/or execute distributed processing applications
  • Example: MapReduce, Spark, Apache HAMA, Apache Giraph

YARN

Providing H.A. in YARN

  • Recovery
  • Failover
  • Stateless

Failure Model#1: Recovery

Failure Model#2: Failover

Failure Model#3: Stateless

What storage to use?

  • Apache proposed

    Hadoop Distributed File System (HDFS)

    Fault-tolerant, large datasets, streaming access to data and more
  • ZooKeeper

    Highly reliable distributed coordination

    Wait-free, FIFO client ordering, linearizables writes and more

Apache Spark In-Memory Data Processing

  • Why Spark
  • Introduction
  • Basics
  • Hands-on
  • Installation
  • Examples

Why Spark ?

  • Most of Machine Learning Algorithms are iterative because each iteration can improve the results
  • With Disk based approach each iteration’s output is written to disk making it slow

About Apache Spark

  • Initially started at UC Berkeley in 2009
  • Fast and general purpose cluster computing system
  • 10x (on disk) - 100x (In-Memory) faster
  • Most popular for running Iterative Machine Learning Algorithms.
  • Provides high level APIs in
  • Java
  • Scala
  • Python
  • Integration with Hadoop and its eco-system and can read existing data.
  • http://spark.apache.org/

Spark Stack

  • Spark SQL

    For SQL and unstructured data processing
  • MLib

    Machine Learning Algorithms
  • GraphX

    Graph Processing
  • Spark Streaming

    stream processing of live data streams

Execution Flow

Spark Terminology

  • Application Jar
  • Driver Program
  • Cluster Manager
  • Deploy Mode

Cluster Deployment

  • Standalone Deploy Mode
  • Amazon EC2
  • Apache Mesos
  • Hadoop YARN

Monitoring

Monitoring – Stages


Standalone (Java)

import org.apache.spark.api.java.*; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; public class SimpleApp { 
  public static void main(String[] args) { 
    String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system     SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local");     JavaSparkContext sc = new JavaSparkContext(conf);     JavaRDD logData = sc.textFile(logFile).cache();     long numAs = logData.filter(new Function() {       public Boolean call(String s) { return s.contains("a"); }     }).count();     long numBs = logData.filter(new Function() {       public Boolean call(String s) { return s.contains("b"); }     }).count();     System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);   } }

Day 2 :- Data ingestions

  • Sqoop- Data migration(DBMS/RDBMS)
  • Flume- Realtime-
    (Twitter,Linkedin,Facebook)
Is it Helpful?
Copyright ©2015 Hub4Tech.com, All Rights Reserved. Hub4Tech™ is registered trademark of Hub4tech Portal Services Pvt. Ltd.
All trademarks and logos appearing on this website are the property of their respective owners.
FOLLOW US