Large-scale machine learning 1000-319bBML
-Distributing computation to clusters of commodity machines and distributed file system.
-MapReduce model and basic algorithmic techniques for this model. Comparing of MapReduce algorithms and typical algorithms for typical problems (matrix multiplication, multi-way join, counting triangles in large graphs).
-Total vs elapsed communication cost. Skew and methods to deal with it.
-Spark and Resilient Distributed Dataset model.
-Spark SQL and its optimizations.
-Serialization of Big data and columnar formats.
-Managed cloud data warehouse.
-Algorithms for stream pressing.
-Distributing typical machine learning algorithms, e.g., linear regression, clustering, decision trees or neural networks.
-Neural networks in large scale (data parallelism, model paralelizm).
-Learned index structores.
Course coordinators
Term 2024Z: | Term 2023Z: |
Type of course
Requirements
Prerequisites (description)
Assessment criteria
Final mark based big programming assignments, points for participation in laboratories and written exam.
Bibliography
-Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Mining of Massive Datasets. Cambridge University Press
-Guglielmo Iozzia, Hands-On Deep Learning with Apache Spark, Packt Publishing
-Butch Quinto, Next-Generation Machine Learning with Spark: Covers XGBoost, -LightGBM, Spark NLP, Distributed Deep Learning with Keras, and More, Apress