Large-scale machine learning 1000-319bBML

-Distributing computation to clusters of commodity machines and distributed file system.
-MapReduce model and basic algorithmic techniques for this model. Comparing of MapReduce algorithms and typical algorithms for typical problems (matrix multiplication, multi-way join, counting triangles in large graphs).
-Total vs elapsed communication cost. Skew and methods to deal with it.
-Spark and Resilient Distributed Dataset model.
-Spark SQL and its optimizations.
-Serialization of Big data and columnar formats.
-Managed cloud data warehouse.
-Algorithms for stream pressing.
-Distributing typical machine learning algorithms, e.g., linear regression, clustering, decision trees or neural networks.
-Neural networks in large scale (data parallelism, model paralelizm).
-Learned index structores.

Course coordinators

Term 2024Z:

Zuzanna Dudka
Antoni Kisło

Term 2023Z:

Antoni Kisło
Konrad Klimiuk

Type of course

elective monographs

Requirements

Deep neural networks
Natural language processing
Statistical machine learning

Prerequisites (description)

object oriented programming, computer networks, algorithms and data structures

Assessment criteria

Final mark based big programming assignments, points for participation in laboratories and written exam.

Bibliography

-Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Mining of Massive Datasets. Cambridge University Press
-Guglielmo Iozzia, Hands-On Deep Learning with Apache Spark, Packt Publishing
-Butch Quinto, Next-Generation Machine Learning with Spark: Covers XGBoost, -LightGBM, Spark NLP, Distributed Deep Learning with Keras, and More, Apress