The document summarizes Milind Bhandarkar's work developing Hamster, a system for running MPI applications on Hadoop YARN. Some key points:
- Hamster allows MPI applications to run alongside Hadoop dataflow jobs on the same cluster managed by YARN. It implements an MPI runtime on top of YARN.
- Hamster's design leverages OpenMPI's strengths while allowing it to integrate with YARN. It includes an application master, node service, and scheduler component.
- Performance tests show Hamster has low overhead and scales well for large MPI jobs. It introduces only a small performance penalty compared to running MPI natively with OpenMPI.
- Example results are shown
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
1. The Zoo Expands
Labrador 💛 Elephant,Thanks to Hamster
Milind Bhandarkar
Chief Scientist, Pivotal Software, Inc.
2. About Me
• http://www.linkedin.com/in/milindb
• Founding member of Hadoop team atYahoo! [2005-2010]
• Contributor to Apache Hadoop since v0.1
• Built and led Grid SolutionsTeam atYahoo! [2007-2010]
• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)
• Center for Development of Advanced Computing (C-DAC), National Center
for Supercomputing Applications (NCSA), Center for Simulation of Advanced
Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by
QLogic),Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
3. Hamster
• Hadoop and MPI on the
same cluster
• Runtime for OpenMPI
applications onYARN
• Available on Pivotal HD
4. Why MPI ?
•Hadoop Dataflow paradigms (MapReduce,
TeZ etc) not suitable for iterative
applications
•Message Passing Interface (MPI)
•Mature standard
•Used extensively in HPC
•Huge ecosystem
5. MPI in Science & Engg
Earth Atmosphere
Chemistry
Biology
Math Nuclear
7. OpenMPI
•Mature Open Source implementation of MPI
3.0 Standard (mpi-forum.org)
•New BSD license
•30+ contributing organizations from
academia, research and industry
•http://open-mpi.org
12. Hamster AppMaster
• Master daemon for MPI ( similar to JobTracker in
MapReduce)
• Implements and participates in theYARN-RM App
lifecycle protocol
• Maintains heartbeat with RM to ensure liveness
• MPI Scheduler - Negotiates resource allocation with
YARN-RM
• Head Node Process (HNP) - manages job execution
13. Hamster Node Service
•User-level daemon per MPI job
•Manages task execution
•Coarse-grained container management
•Bootstrapped byYARN-NM
•Implemented asYARN Auxiliary Service
14.
15. Why GraphLab on
Hadoop ?
•Graph Analytics & Machine Learning only
one stage in E2E data pipeline
•ETL/Preprocessing
•Building Graphs from fact & dimension
tables
•Publishing analytics results, post-processing
16. GraphLab 2.2
•Communication patterns based on Data
•SeveralToolkits (Graph Analytics + ML
Algorithms) available
•Graph-Programming API
•Uses MPI for communication
17. Pivotal HD
HDFS
HBase Pig, Hive,
Mahout
Map
Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Apache Pivotal
Command
Center
Configure,
Deploy, Monitor,
Manage
Spring XD
Pivotal HD
Enterprise
Spring
Xtension
Framework
Catalog
Services
Query
Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ – Advanced
Database Services
Distributed
In-memory
Store
Query
Transactions
Ingestion
Processing
Hadoop Driver –
Parallel with Compaction
ANSI SQL + In-Memory
GemFire XD – Real-Time
Database Services
MADlib Algorithms
Oozie
Virtual
Extensions
Graphlab,
Open MPI