CIS 545: Big Data Analytics (Spring 2019)

Current Semester

This website is for a previous iteration of CIS 545. To access the current one, click here!

Time & location

Location: Meyerson Hall B1
Mondays + Wednesdays 12:00pm - 1:30pm.

There are optional recitation / lab sessions on Fridays 1:30pm - 3:00pm in Towne 100.
If you have a conflict, it is okay to not attend the recitation; you may remain in the class!

Instructors

Susan Davidson
Location: 566 Levine Hall
Office hour: Mon 2:00-3:00pm

Clayton Greenberg
Location: 506 Levine Hall
Office hour: Wed 2:00-3:00pm

Teaching assistants

TA Name, PennKey	Office Hour Time	Office Hour Location
Aditya Srivatsan, adisri Jeffrey Zhou, jmzhou	Mon 5:30pm-7:30pm	Levine 512
Vatsal Chanana, chanana Brian Sandler, bms	Tue 5pm-7pm	Levine 512
Bhavna Saluja, bsaluja Leonardo Murri, murri	Wed 6pm-8pm	Levine 512
Isha Gupta, isgupta Roshan Santhosh, roshansk	Thu 7pm-9pm	Levine 512
Nanthini Balasubramanian, nanthini Craig Fan, fancraig	Fri 2pm-4pm	Towne 100
Gauri Pradhan, gpradhan	Sat 12pm-2pm	Levine 512
Leshang Chen, lechangc Hanlin Xiao, hlxiao	Sun 7pm-9pm	Levine 512
Andrew Cui, andrewc Benjamin Fineran, fineran	By Appointment	By Appointment

Course description

In the new era of big data, we are increasingly faced with the challenges of processing vast volumes of data. Given the limits of individual machines (compute power, memory, bandwidth), increasingly the solution is to process the data in parallel on many machines. This course focuses on the fundamentals of scaling computation to handle common data analytics tasks. You will learn about basic tasks in collecting, wrangling, and structuring data; programming models for performing certain kinds of computation in a scalable way across many compute nodes; common approaches to converting algorithms to such programming models; standard toolkits for data analysis consisting of a wide variety of primitives; and popular distributed frameworks for analytics tasks such as filtering, graph analysis, clustering, and classification.

Format

The format will be two 1.5-hour lectures per week, plus assigned readings from books and handouts. There will be regular homework assignments and a substantial implementation project with a hypothesis, evaluation, and a report. There will also be an in-class midterm and a final exam.

Prerequisites

This course expects broad familiarity with probability and statistics, as well as programming in Python. CIS 110, MCIT 590, or the equivalent is required. Additional background in statistics, data analysis (e.g., in Matlab or R), and machine learning (e.g., CIS 519) is helpful.

Texts and readings

We recommend several books for students of different skill levels. The tentative list is:

For students who do not have at least 2 years of a CS degree: You should get the book Data Science from Scratch, by Grus, from O'Reilly. This book provides a quick refresher in Python, probability, statistics, and linear algebra. An online version can be accessed from O'Reilly's Safari service.

For all students: Python for Data Analysis, by McKinney, from O'Reilly.

For advanced students: Python Machine Learning, by Raschka, from Packt.

If you are new to Python and data science, you may find the UC Berkeley free book The Foundations of Data Science useful.

Grading

Homework and projects 55%, midterm 15%, final 25%, participation 5%.

Important sites

We will be using Piazza for course-related discussions; please sign up.

Likewise, please register your SEAS or Google account with the homework submission site.

Lecture Recodings

Available via Canvas, but you will learn more if you actually attend class!

Assignments

Links to homework assignments will be available in the schedule below.

Project option

You may elect to take a homework option involving the completion of 6 homeworks, or a project option involving the completion of 3 advanced homeworks plus a term project. For this project, you will be expected to work in small teams and choose a data analysis task with a suitably large dataset, and to define and execute a series of clustering and modeling tasks over it.

You can find interesting data sets at:

Kaggle (many competitions)
Stanford SNAP (Graph data)
WikiData (Open data)
DBpedia (Wikipedia data)
OpenDataPhilly (government)
IEEG.org (time series data)
ADNI (Alzheimer's genetic data)
OpenWeatherMap (weather data)
(from DataQuest)

Schedule

Previous iterations

Fall 2018 Spring 2018 Spring 2017