SlideShare a Scribd company logo
1 of 29
Download to read offline
The Zoo Expands	

Labrador 💛 Elephant,Thanks to Hamster
Milind Bhandarkar	

Chief Scientist, Pivotal Software, Inc.
About Me
• http://www.linkedin.com/in/milindb	

• Founding member of Hadoop team atYahoo! [2005-2010]	

• Contributor to Apache Hadoop since v0.1	

• Built and led Grid SolutionsTeam atYahoo! [2007-2010]	

• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)	

• Center for Development of Advanced Computing (C-DAC), National Center
for Supercomputing Applications (NCSA), Center for Simulation of Advanced
Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by
QLogic),Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
Hamster
• Hadoop and MPI on the
same cluster	

• Runtime for OpenMPI
applications onYARN	

• Available on Pivotal HD
Why MPI ?
•Hadoop Dataflow paradigms (MapReduce,
TeZ etc) not suitable for iterative
applications	

•Message Passing Interface (MPI)	

•Mature standard	

•Used extensively in HPC	

•Huge ecosystem
MPI in Science & Engg
Earth Atmosphere
Chemistry
Biology
Math Nuclear
MPI in Industry
Mechanical ar
Finance/bankOil Exploration Cryptography
Spacecraft
OpenMPI
•Mature Open Source implementation of MPI
3.0 Standard (mpi-forum.org)	

•New BSD license	

•30+ contributing organizations from
academia, research and industry	

•http://open-mpi.org
OpenMPI Architecture
Pluggable
Hamster Design
•YARN as Resource Manager	

•Hamster Application Manager	

•Manages MPI jobs	

•(tries to) Implement Gang-Scheduling	

•Leverages OMPI/ORTE strengths	

•Wire-up,Task monitoring, Fast Interconnect
Hamster Architecture
Resource Manager
Scheduler
AMService
Node Manager Node Manager Node Manager
…
Proc/Container
Framework Daemon NSMPI Scheduler HNP
MPI AM
Proc/Container
…RM-AM
AM-NM
RM-NodeManager
Client
Client-RM
Aux Srvcs
Proc/Container
Framework Daemon NS
Proc/Container
…
Aux Srvcs
RM-
NodeManager
Hamster AppMaster
• Master daemon for MPI ( similar to JobTracker in
MapReduce)	

• Implements and participates in theYARN-RM App
lifecycle protocol	

• Maintains heartbeat with RM to ensure liveness	

• MPI Scheduler - Negotiates resource allocation with
YARN-RM	

• Head Node Process (HNP) - manages job execution
Hamster Node Service
•User-level daemon per MPI job	

•Manages task execution	

•Coarse-grained container management	

•Bootstrapped byYARN-NM	

•Implemented asYARN Auxiliary Service
Why GraphLab on
Hadoop ?
•Graph Analytics & Machine Learning only
one stage in E2E data pipeline	

•ETL/Preprocessing	

•Building Graphs from fact & dimension
tables	

•Publishing analytics results, post-processing
GraphLab 2.2
•Communication patterns based on Data	

•SeveralToolkits (Graph Analytics + ML
Algorithms) available	

•Graph-Programming API	

•Uses MPI for communication
Pivotal HD
HDFS
HBase Pig, Hive,
Mahout
Map
Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Apache Pivotal
Command
Center
Configure,
Deploy, Monitor,
Manage
Spring XD
Pivotal HD
Enterprise
Spring
Xtension
Framework
Catalog
Services
Query
Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ – Advanced
Database Services
Distributed
In-memory
Store
Query
Transactions
Ingestion
Processing
Hadoop Driver –
Parallel with Compaction
ANSI SQL + In-Memory
GemFire XD – Real-Time
Database Services
MADlib Algorithms
Oozie
Virtual
Extensions
Graphlab,
Open MPI
Performance
Test Environment
•Pivotal Analytics Workbench Cluster	

•Pivotal HD 1.1 (Apache Hadoop 2.0.5)	

•Hamster - 1.0, OpenMPI-1.7.2	

•515 nodes	

•2x6-core Westmere, 48GB RAM, 12x2TB
SATA, Mellanox FDR Infiniband
Null Job
•Measures overhead of launching MPI jobs	

•Tests scalability of resource allocation,
launching and wire-up	

•Sub-linear scalability (slightly worse than
O(logN)	

•Overhead of launching 15000 processes = 1
minute
Total RuntimeTime(Sec.)
5
18.75
32.5
46.25
60
Process number
0 4000 8000 12000 16000
E2E time
AllocationTimeTime(Sec.)
1
2.25
3.5
4.75
6
Number of Processes
0 4000 8000 12000 16000
Allocation Time
LaunchTimeTime(Sec.)
0
7.5
15
22.5
30
Number of processes
0 4000 8000 12000 16000
Launch Time
Comparison with
OpenMPI
•HPL (HP Linpack forTop-500)	

•Number of processes 50—1000	

•Hamster 1% slower than OpenMPI
HPL - Hamster vs
OpenMPI
Time(Sec.)
0
30
60
90
120
1000 500 200 50
GraphLab ALS
•Wikipedia dataset	

•4.3 M terms, 3.3M documents, 513M
occurrences	

•17 Processes	

•5 Iterations
GraphLab ALS
Time(Sec.)
0
335
670
1005
1340
Hamster OpenMPI
GraphLab PageRank
•Twitter Dataset	

•4.1 M nodes, 1.4 B edges	

•Data Size : 26GB	

•NP = 17	

•50 iterations: 297 seconds	

•100 iterations: 339 seconds
Questions?

More Related Content

What's hot

Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigDataWorks Summit/Hadoop Summit
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApachePivotalOpenSourceHub
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deploymentNovita Sari
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down InternetMapR Technologies
 

What's hot (20)

Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deployment
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
Apache drill
Apache drillApache drill
Apache drill
 

Viewers also liked

Zoo Information Management System (ZIMS) for Anna Zoological Park, Chennai, I...
Zoo Information Management System (ZIMS) for Anna Zoological Park, Chennai, I...Zoo Information Management System (ZIMS) for Anna Zoological Park, Chennai, I...
Zoo Information Management System (ZIMS) for Anna Zoological Park, Chennai, I...University College of Jaffna
 
My zoo project final
My zoo project finalMy zoo project final
My zoo project finalIrish3
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Kevin Weil
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Kevin Weil
 
Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Kevin Weil
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Kevin Weil
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
System architecture infosheet
System architecture infosheetSystem architecture infosheet
System architecture infosheetjeanrummy
 
Zoo management system
Zoo management systemZoo management system
Zoo management systemKanika Pal
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joinsShalish VJ
 
Architecture design in software engineering
Architecture design in software engineeringArchitecture design in software engineering
Architecture design in software engineeringPreeti Mishra
 

Viewers also liked (20)

Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Zoo Information Management System (ZIMS) for Anna Zoological Park, Chennai, I...
Zoo Information Management System (ZIMS) for Anna Zoological Park, Chennai, I...Zoo Information Management System (ZIMS) for Anna Zoological Park, Chennai, I...
Zoo Information Management System (ZIMS) for Anna Zoological Park, Chennai, I...
 
Tech analysis
Tech analysisTech analysis
Tech analysis
 
My zoo project final
My zoo project finalMy zoo project final
My zoo project final
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
 
Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
System architecture infosheet
System architecture infosheetSystem architecture infosheet
System architecture infosheet
 
Zoo management system
Zoo management systemZoo management system
Zoo management system
 
Zoo Project PowerPoint
Zoo Project PowerPointZoo Project PowerPoint
Zoo Project PowerPoint
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Zoo presentation....
Zoo presentation....Zoo presentation....
Zoo presentation....
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joins
 
Architecture design in software engineering
Architecture design in software engineeringArchitecture design in software engineering
Architecture design in software engineering
 

Similar to The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?rhatr
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloudrhatr
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on HadoopDataWorks Summit
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 

Similar to The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster (20)

Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hadoop Summit 2010 Keynote
Hadoop Summit 2010 KeynoteHadoop Summit 2010 Keynote
Hadoop Summit 2010 Keynote
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloud
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 

Recently uploaded

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 

Recently uploaded (20)

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 

The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

  • 1. The Zoo Expands Labrador 💛 Elephant,Thanks to Hamster Milind Bhandarkar Chief Scientist, Pivotal Software, Inc.
  • 2. About Me • http://www.linkedin.com/in/milindb • Founding member of Hadoop team atYahoo! [2005-2010] • Contributor to Apache Hadoop since v0.1 • Built and led Grid SolutionsTeam atYahoo! [2007-2010] • Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu) • Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic),Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
  • 3. Hamster • Hadoop and MPI on the same cluster • Runtime for OpenMPI applications onYARN • Available on Pivotal HD
  • 4. Why MPI ? •Hadoop Dataflow paradigms (MapReduce, TeZ etc) not suitable for iterative applications •Message Passing Interface (MPI) •Mature standard •Used extensively in HPC •Huge ecosystem
  • 5. MPI in Science & Engg Earth Atmosphere Chemistry Biology Math Nuclear
  • 6. MPI in Industry Mechanical ar Finance/bankOil Exploration Cryptography Spacecraft
  • 7. OpenMPI •Mature Open Source implementation of MPI 3.0 Standard (mpi-forum.org) •New BSD license •30+ contributing organizations from academia, research and industry •http://open-mpi.org
  • 10. Hamster Design •YARN as Resource Manager •Hamster Application Manager •Manages MPI jobs •(tries to) Implement Gang-Scheduling •Leverages OMPI/ORTE strengths •Wire-up,Task monitoring, Fast Interconnect
  • 11. Hamster Architecture Resource Manager Scheduler AMService Node Manager Node Manager Node Manager … Proc/Container Framework Daemon NSMPI Scheduler HNP MPI AM Proc/Container …RM-AM AM-NM RM-NodeManager Client Client-RM Aux Srvcs Proc/Container Framework Daemon NS Proc/Container … Aux Srvcs RM- NodeManager
  • 12. Hamster AppMaster • Master daemon for MPI ( similar to JobTracker in MapReduce) • Implements and participates in theYARN-RM App lifecycle protocol • Maintains heartbeat with RM to ensure liveness • MPI Scheduler - Negotiates resource allocation with YARN-RM • Head Node Process (HNP) - manages job execution
  • 13. Hamster Node Service •User-level daemon per MPI job •Manages task execution •Coarse-grained container management •Bootstrapped byYARN-NM •Implemented asYARN Auxiliary Service
  • 14.
  • 15. Why GraphLab on Hadoop ? •Graph Analytics & Machine Learning only one stage in E2E data pipeline •ETL/Preprocessing •Building Graphs from fact & dimension tables •Publishing analytics results, post-processing
  • 16. GraphLab 2.2 •Communication patterns based on Data •SeveralToolkits (Graph Analytics + ML Algorithms) available •Graph-Programming API •Uses MPI for communication
  • 17. Pivotal HD HDFS HBase Pig, Hive, Mahout Map Reduce Sqoop Flume Resource Management & Workflow Yarn Zookeeper Apache Pivotal Command Center Configure, Deploy, Monitor, Manage Spring XD Pivotal HD Enterprise Spring Xtension Framework Catalog Services Query Optimizer Dynamic Pipelining ANSI SQL + Analytics HAWQ – Advanced Database Services Distributed In-memory Store Query Transactions Ingestion Processing Hadoop Driver – Parallel with Compaction ANSI SQL + In-Memory GemFire XD – Real-Time Database Services MADlib Algorithms Oozie Virtual Extensions Graphlab, Open MPI
  • 19. Test Environment •Pivotal Analytics Workbench Cluster •Pivotal HD 1.1 (Apache Hadoop 2.0.5) •Hamster - 1.0, OpenMPI-1.7.2 •515 nodes •2x6-core Westmere, 48GB RAM, 12x2TB SATA, Mellanox FDR Infiniband
  • 20. Null Job •Measures overhead of launching MPI jobs •Tests scalability of resource allocation, launching and wire-up •Sub-linear scalability (slightly worse than O(logN) •Overhead of launching 15000 processes = 1 minute
  • 24. Comparison with OpenMPI •HPL (HP Linpack forTop-500) •Number of processes 50—1000 •Hamster 1% slower than OpenMPI
  • 25. HPL - Hamster vs OpenMPI Time(Sec.) 0 30 60 90 120 1000 500 200 50
  • 26. GraphLab ALS •Wikipedia dataset •4.3 M terms, 3.3M documents, 513M occurrences •17 Processes •5 Iterations
  • 28. GraphLab PageRank •Twitter Dataset •4.1 M nodes, 1.4 B edges •Data Size : 26GB •NP = 17 •50 iterations: 297 seconds •100 iterations: 339 seconds