Search courses 👉
Professional Course

Advanced Spark for Developers

Length
28 hours
Price
700 EUR + tax
Next course start
Start Anytime! See details
Delivery
Self-paced Online
Length
28 hours
Price
700 EUR + tax
Next course start
Start Anytime! See details
Delivery
Self-paced Online
This provider usually responds within 48 hours 👍

Course description

This course will help trainees get a proper understanding of the internal structure and functioning of Apache Spark – Spark Core (RDD), Spark SQL, Spark Streaming and Spark Structured Streaming. We will discuss the mechanisms of running Spark cluster components under control of various cluster managers, resource allocation management, and scheduler operation mechanisms. It will focus on advantages of the Tungsten format of internal view and Catalyst optimizer.

Upcoming start dates

1 start date available

Start Anytime!

  • Self-paced Online
  • Online
  • English

Who should attend?

Prerequisites

Development experience in Java or Scala for Apache Spark over 3 months.

Training content

Module 0 - Scala in one day

1. Examine Scala features used in the Spark framework

2. Theory:

1. var and val, val (x, x), lazy val, transient lazy val

2. type and Type, (Nil, None, Null => null, Nothing, Unit => (), Any, AnyRef, AnyVal, String, interpolation

3. class, object (case), abstract class, trait

4. Scala function, methods, lambda

5. Generic, ClassTag, covariant, contravariant, invariant position, F[_], *

6. Pattern matching and if then else construction

7. Mutable and Immutable collection, Iterator, collection operation

8. Monads (Option, Either, Try, Future, ....), Try().recovery

9. map, flatMap, foreach, for comprehension

10. Implicits, private[sql], package

11. Scala sbt, assembly

12. Encoder, Product

13. Scala libs for Spark: scopt, chimney, jsoniter


Module 1 – RDD


1. Theory RDD api:

1. RDD creating api: from array, from file. from DS

2. RDD base operations: map, flatMap, filter, reduceByKey, sort

3. Time parse libs


2. Theory RDD under the hood:

1. Iterator + mapPartitions()

2. RDD creating path: compute() and getPartitions()

3. Partitions

4. Partitioner: Hash and Range

5. Dependencies: wide and narrow

6. Joins: inner, cogroup, join without shuffle

7. Query Plan


Module 2 - DataFrame & DataSet, Spark DSL & Spark SQL


1. Theory DataFrame, DataSet api:

1. Creating DataFrame: memory from file (HDFS, S3, FS) (Avro, Orc, Parquet)

2. Spark DSL: Join broadcast, grouped operations

3. Spark SQL: Window functions, single partitions

4. Scala UDF problem-solving

5. Spark catalog


2. Recreate code using plans

1. Catalyst Optimiser: Logical & Physical plans

2. Codegen

3. Persist vs Cache vs Checkpoint

4. Creating DataFrame Path

5. Raw vs InternalRaw

Module 3 - Spark optimization

1. Compare speed, size RDD, DataFrame, DataSet

2. Compare crimes counting: SortMerge Join, BroadCast, BlumFilter

3. Resolve problems with a skewed join

4. Build UDF for Python and Scala

5. UDF Problems

Module 4 - External and Connectors

1. How to read/write data from file storages (HDFS, S3, FTP, FS)

2. What data format to choose (Json, CSV, Avro, Orc, Parquet, Delta, ... )

3. How to parallelize reading/writing to JDBC

4. How to create dataframe from MPP (Cassandra, vertica, gp)

5. How to work with Kafka

6. How to write your own connectors

7. Write UDF for joining with cassandra


Module 5 – Testing

1. Write a test for data marts written in module (Exercise: find popular time for orders, find the most popular boroughs for orders, find distance distribution for orders grouped by boroughs)

2. Theory:

1. Unit testing

2. Code review

3. QA

4. CI/CD

5. Problems

6. Libs which solve these problems

Module 6 - Spark Cluster

1. Build config with allocation

2. Compare several workers

3. Dynamic Resource Allocation

4. Manual managing executors runtime

Module 7 - Spark streaming

1. [Solve problem with Cassandra writing](src/main/scala/mod4connectors/DataSetsWithCassandra.scala)

2. Build Spark Structure Reading Kafka

3. Build Spark Structure Using State

4. Build Spark Structure Writing Cassandra

Certification / Credits

Objectives

  • Understand Spark’s internal structure
  • Understand the deployment, configuration and execution of Spark components on various clusters (Standalone, YARN, Mesos)
  • Optimize RDD-based Spark jobs
  • Optimize Spark SQL jobs
  • Optimize Microbatch and Structured Streaming jobs

Quick stats about Luxoft Training Center?

More than 200 training courses

Conducted over 1,500 training sessions

Customized training solutions for business

Contact this provider

Contact course provider

Fill out your details to find out more about Advanced Spark for Developers.

  Contact the provider

  Get more information

  Register your interest

Country *

reCAPTCHA logo This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Luxoft Training Center
Warsaw Spire, plac Europejski 1
00-844 Warsaw

Luxoft Training Center

Luxoft Training Center — an essential part of the global technology leader, Luxoft, a DXC Technology Company. We play a pivotal role in propelling B2B businesses forward by delivering customized training solutions. Emphasizing the significance of learning and employee development,...

Read more and show all training delivered by this supplier

Ads