Name: Advanced Spark for Developers
Brand: Luxoft Training Center
SKU: 2017726

Course description

This course will help trainees get a proper understanding of the internal structure and functioning of Apache Spark – Spark Core (RDD), Spark SQL, Spark Streaming and Spark Structured Streaming. We will discuss the mechanisms of running Spark cluster components under control of various cluster managers, resource allocation management, and scheduler operation mechanisms. It will focus on advantages of the Tungsten format of internal view and Catalyst optimizer.

Upcoming start dates

1 start date available

Start Anytime!

Self-paced Online
Online
English

Inquire

/ person

Who should attend?

Prerequisites

Development experience in Java or Scala for Apache Spark over 3 months.

Training content

Module 0 - Scala in one day

1. Examine Scala features used in the Spark framework

2. Theory:

1. var and val, val (x, x), lazy val, transient lazy val

2. type and Type, (Nil, None, Null => null, Nothing, Unit => (), Any, AnyRef, AnyVal, String, interpolation

3. class, object (case), abstract class, trait

4. Scala function, methods, lambda

5. Generic, ClassTag, covariant, contravariant, invariant position, F[_], *

6. Pattern matching and if then else construction

7. Mutable and Immutable collection, Iterator, collection operation

8. Monads (Option, Either, Try, Future, ....), Try().recovery

9. map, flatMap, foreach, for comprehension

10. Implicits, private[sql], package

11. Scala sbt, assembly

12. Encoder, Product

13. Scala libs for Spark: scopt, chimney, jsoniter

Module 1 – RDD

1. Theory RDD api:

1. RDD creating api: from array, from file. from DS

2. RDD base operations: map, flatMap, filter, reduceByKey, sort

3. Time parse libs

2. Theory RDD under the hood:

1. Iterator + mapPartitions()

2. RDD creating path: compute() and getPartitions()

3. Partitions

4. Partitioner: Hash and Range

5. Dependencies: wide and narrow

6. Joins: inner, cogroup, join without shuffle

7. Query Plan

Module 2 - DataFrame & DataSet, Spark DSL & Spark SQL

1. Theory DataFrame, DataSet api:

1. Creating DataFrame: memory from file (HDFS, S3, FS) (Avro, Orc, Parquet)

2. Spark DSL: Join broadcast, grouped operations

3. Spark SQL: Window functions, single partitions

4. Scala UDF problem-solving

5. Spark catalog

2. Recreate code using plans

1. Catalyst Optimiser: Logical & Physical plans

2. Codegen

3. Persist vs Cache vs Checkpoint

4. Creating DataFrame Path

5. Raw vs InternalRaw

Module 3 - Spark optimization

1. Compare speed, size RDD, DataFrame, DataSet

2. Compare crimes counting: SortMerge Join, BroadCast, BlumFilter

3. Resolve problems with a skewed join

4. Build UDF for Python and Scala

5. UDF Problems

Module 4 - External and Connectors

1. How to read/write data from file storages (HDFS, S3, FTP, FS)

2. What data format to choose (Json, CSV, Avro, Orc, Parquet, Delta, ... )

3. How to parallelize reading/writing to JDBC

4. How to create dataframe from MPP (Cassandra, vertica, gp)

5. How to work with Kafka

6. How to write your own connectors

7. Write UDF for joining with cassandra

Module 5 – Testing

1. Write a test for data marts written in module (Exercise: find popular time for orders, find the most popular boroughs for orders, find distance distribution for orders grouped by boroughs)

2. Theory:

1. Unit testing

2. Code review

3. QA

4. CI/CD

5. Problems

6. Libs which solve these problems

Module 6 - Spark Cluster

1. Build config with allocation

2. Compare several workers

3. Dynamic Resource Allocation

4. Manual managing executors runtime

Module 7 - Spark streaming

1. [Solve problem with Cassandra writing](src/main/scala/mod4connectors/DataSetsWithCassandra.scala)

2. Build Spark Structure Reading Kafka

3. Build Spark Structure Using State

4. Build Spark Structure Writing Cassandra

Certification / Credits

Objectives

Understand Spark’s internal structure
Understand the deployment, configuration and execution of Spark components on various clusters (Standalone, YARN, Mesos)
Optimize RDD-based Spark jobs
Optimize Spark SQL jobs
Optimize Microbatch and Structured Streaming jobs

Quick stats about Luxoft Training Center?

More than 200 training courses

Conducted over 1,500 training sessions

Customized training solutions for business

Contact this provider

Contact course provider

Fill out your details to find out more about Advanced Spark for Developers.

Contact the provider

Get more information

Country *

Please recommend similar options

I accept the: Terms and Conditions & Privacy Policy

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Luxoft Training Center

Warsaw Spire, plac Europejski 1

00-844 Warsaw

+48122110666

Luxoft Training Center

Luxoft Training Center — an essential part of the global technology leader, Luxoft, a DXC Technology Company. We play a pivotal role in propelling B2B businesses forward by delivering customized training solutions. Emphasizing the significance of learning and employee development,...

Ads

Advanced Spark for Developers

Course description

Upcoming start dates

Start Anytime!

Upcoming start dates

Who should attend?

Prerequisites

Training content

Certification / Credits

Objectives

Quick stats about Luxoft Training Center?

Contact this provider

Contact course provider

Luxoft Training Center

You may also like...

Test-Driven Development with Java

Automation Test Engineer - Master's Program

Git & GitHub Boot Camp

Effective User Acceptance Testing

Fundamentals of Secure Application Development

Python Training Certification - eLearning

Java Certification Training - eLearning

Full Stack MERN Developer Master's Program Certification

ISTQB Foundation Software Testing - Online Classroom (exam included)