CIS 545 (Initially 700-003): Big Data Analytics (Spring 2017, Beta version)

Time & location Wu and Chen Auditorium
Mondays + Wednesdays 12:00pm - 1:30pm.
Please reserve Fridays 12:00pm-1:00pm for lab and recitation sessions. Bring your laptop!
Instructors Zachary Ives
Location: 576 Levine Hall
Office hour: Wed 2:00-3:00
Susan B. Davidson
Location: 566 Levine Hall North
Office hour: Tues 1:00-2:00
Abdussalam Alawini
Location: 571 Levine Hall North
Office hour: Thursday 1:00-2:00
Teaching assistants All TA office hours are on the 5th floor of the main Levine (not Levine North) building, in the common space by the main stairwell.
JT Cho, joncho
Tuesdays 4:00-5:00PM, Wednesdays 4:30-5:30PM
Trevin Gandhi, gandhit
Mondays, 4:00-5:00PM, Thursdays 3:00-4:00PM
Jordan Hurwitz, jhurwitz
Tuesdays, 6:30-7:30PM, Wednesdays 6:30-7:30PM
Eddie Kong, ekong
Brayden Neal, nealb
Mondays, 4:00-5:00PM, Wednesdays, 5:00-6:00PM
Sierra Yit, sierray
Thursdays 2:30-4:30PM
Course description

In the new era of big data, we are increasingly faced with the challenges of processing vast volumes of data. Given the limits of individual machines (compute power, memory, bandwidth), increasingly the solution is to process the data in parallel on many machines. This course focuses on the fundamentals of scaling computation to handle common data analytics tasks. You will learn about basic tasks in collecting, wrangling, and structuring data; programming models for performing certain kinds of computation in a scalable way across many compute nodes; common approaches to converting algorithms to such programming models; standard toolkits for data analysis consisting of a wide variety of primitives; and popular distributed frameworks for analytics tasks such as filtering, graph analysis, clustering, and classification.

Students be aware: This class should be considered in "beta" form, as a first offering that is attempting to "push the envelope" with the latest tools and technologies.

Format The format will be two 1.5-hour lectures per week, plus assigned readings from books and handouts, and frequent (optional but highly recommended) recitations. There will be regular homework assignments and a substantial implementation project with a hypothesis, evaluation, and a report. There will also be an in-class midterm and a final exam.
Prerequisites This course expects broad familiarity with probability and statistics, as well as programming in Python. CIS 110, MCIT 590, or the equivalent is required. Additional background in statistics, data analysis (e.g., in Matlab or R), and machine learning (e.g., CIS 519) is helpful.
Texts and readings

We recommend several books for students of different skill levels. The tentative list is:

For students who do not have at least 2 years of a CS degree: You should get the book Data Science from Scratch, by Grus, from O'Reilly. This book provides a quick refresher in Python, probability, statistics, and linear algebra. An online version can be accessed from O'Reilly's Safari service.

For all students: Python for Data Analysis, by McKinney, from O'Reilly.

For advanced students: Python Machine Learning, by Raschka, from Packt.

Grading Homework and projects 55%, midterm 15%, final 25%, participation 5%.
Important sites

We will be using Piazza for course-related discussions; please sign up. Likewise, please register your SEAS or Google account with the homework submission site.

Assignments The homework assignments will be available here.
Project option

You may elect to take a homework option involving the completion of 6 homeworks, or a project option involving the completion of 3 advanced homeworks plus a term project. For this project, you will be expected to work in small teams and choose a data analysis task with a suitably large dataset, and to define and execute a series of clustering and modeling tasks over it.

You can find interesting data sets at:

Previous iterations This is the first iteration of this course!