Course description
Spark for Data Science | Analyzing Big Data With Spark
Apache Spark is a powerful, open-source processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs. With Spark, you can write sophisticated parallel applications to execute faster decisions, better decisions, and real-time actions, applied to a wide variety of use cases, architectures, and industries.
Apache Spark for Data Science is a three-day, hands-on course geared for technical business professional who wish to solve real-world data related problems using Apache Spark. This course explores using Apache Spark for common data related activities. Students will learn to build unified big data applications combining batch, streaming, and interactive analytics on all their data.
NOTE: The hands-on treatment and focus in this course is geared towards the data science aspects of Spark and related tools. Students who want a more developer-oriented edition of this course should consider theTTSK7503 Spark Developer | Spark for Big Data, Hadoop & Machine Learning which aligns in subject coverage but is geared for developers instead of data scientists.
Course Objectives
This course is approximately50% hands-on, combining expert lecture, real-world demonstrations and group discussions with machine-based practical labs and exercises. Working in a hands-on learning environment led by our expert practitioner students will explore:
- Spark Essentials
- DataFrames
- Spark SQL
- Spark MLib
- Spark Streaming
- Streaming with Kafka
- Data Flow with NiFi
- Spark GraphX
- Performance and Tuning
- Cluster Mode
- Spark - the Big Picture
Trivera offers hundreds of end-to-end skills-focused courses that provide participants with the job-ready skills they require to be truly productive in a modern IT business enterprise. Our courses are available for individuals, their teams, or across their organization, for students of all skill levels and roles. We offer an extensive online Public Course Schedule, deep catalog for Private Courses, flex-hour Mini-Camp short courses, self-paced QuickSkills courses, free webinars and more. Trivera’s unique EveryCourse Extras and AfterCourse Extras programs, included with every course, ensure our students can put their newly-learned skills right to work, while providing them with a solid platform for continued skills-development, support and long-term growth. For more information about our dedicated training services, public course offerings, collaborative coaching services, new hire or enterprise upskilling programs, or to see our complete list of course offerings and special offers please call us toll free at 844-475-4559. Our pricing and services are always satisfaction guaranteed.
Do you work at this company and want to update this page?
Is there out-of-date information about your company or courses published here? Fill out this form to get in touch with us.
Who should attend?
This course is an Introductory level and beyond course. Typical attendees would include systems administrators, testers or technical data related roles who need to learn to use Spark for data analysis or processing data.
Attending students should have the following background:
- Basic knowledge of Python Programming (or students who know R and can pick up Python easily)
- Basic prior exposure to Java syntax (those without that background can copy and paste the labs)
- Introduction to SQL (familiarity wits SQL basics)
- Basic knowledge of Statistics and Probability & Data science
Training content
Getting Started
- Our Data and our problem set
- Accessing the cluster, the data, and the tools
- The Continuous Workshop approach
- "Let's build a model together"
- Focus on analysis, exploration, data munging, algorithms
- Tooling and fundamentals as necessary to get the job done
Spark Overview
- Data Science: The State of the Art
- Hadoop, Yarn, and Spark
- Architectural Overview
- MLib Overview
- HDFS data - Accessing
- Lab Focus
- Working with HDFS data
- Distributed vs. Local Run Modes
- Spark vs. Other tools (when is Spark the right tool for the job?)
- Spark vs. SAS
- Spark Languages (Java, R, Python, and Scala)
- Hello, Spark
Spark Essentials
- Spark Core
- Spark SQL
- Spark and Hive
- Lab
- MLib
- Spark Streaming
- Spark API
DataFrames
- DataFrames and Resilient Distributed Datasets (RDDs)
- Partitions
- Adding variables to a DataFrame
- DataFrame Types
- DataFrame Operations
- Dependent vs. Independent variables
- Map/Reduce with DataFrames
Spark SQL
- Spark SQL Overview
- Data stores: HDFS, Cassandra, HBase, Hive, and S3
- Table Definitions
- Queries
Spark MLib
- MLib overview
- MLib Algorithms Overview
- Classification Algorithms
- Regression Algorithms
- Lab Focus
- Brief Comparison to SAS
- Here's your split, how to tune regression
- Decision Trees and forests
- Lab Focus
- Brief Comparison to SAS
- Stepwise approach to Decision Trees
- Working with Exit Criteria
- Recommendation with ALS
- Clustering Algorithms
- Lab Focus
- Key Clustering Algorithms
- Choosing Clustering Algorithms
- Working with key algorithms
- Machine Learning Pipelines
- Linear Algebra (SVD, PCA)
- Statistics in MLib
Spark Streaming
- Streaming overview
- Real-time data ingestion
- State
- Window Operations
Streaming with Kafka
- Kafka overview
- Kafka and Spark Streaming
Data Flow with NiFi
- Apache NiFi overview
- NiFi data flows with Spark/R
Spark GraphX
- GraphX overview
- ETL with GraphX
- Graph computation
Performance and Tuning
- Broadcast variables
- Accumulators
- Memory Management
Cluster Mode
- Standalone Cluster
- Masters and Workers
- Configurations
- Working with large data sets
Spark - the Big Picture
- Spark in Real-Time and near-Real-Time Decision Support Systems
- Spark in the Enterprise
- Best Practices
Course delivery details
Our course materials include more than a simple slideshow presentation handout. Each student will receive a comprehensive course Student Guide, complete with detailed course notes, code samples, software tutorials, diagrams and related reference materials and links. Our courses also include detailed our Student Workbook, with step by step hands-on lab instructions and project files (as necessary) and solutions, clearly illustrated for users to complete hands-on work in class, and to revisit to review or refresh skills at any time. Students will also receive the course set up files, project files(or code, if applicable) and solutions required for the hands-on work.
Costs
- Price: $2,195.00
- Discounted Price: $1,426.75
Quick stats about Trivera Technologies LLC?
Over 25 years of technology training expertise.
Robust portfolio of over 1,000 leading edge technology courses.
Guaranteed to run courses and flexible learning options.
Contact this provider
Trivera Technologies
Trivera Technologies is a IT education services & courseware firm that offers a range of wide professional technical education services including: end to end IT training development and delivery, skills-based mentoring programs,new hire training and re-skilling services, courseware licensing and...