MKSE 212: Scalable and Cloud Computing (Fall 2012)

Location Towne 311, Tuesday/Thursday 4:30-6:00pm
Instructor Andreas Haeberlen
Location: 560 Levine Hall
Office hour: Mondays 12:30-1:30pm
Teaching assistants Hongda Ma, ma1@seas.upenn.edu
Office hour: Tuesdays 10:00-11:00am (Moore 100A)

Eric O'Brien, ericob@seas.upenn.edu
Office hour: Thursdays 9:00-10:00am (Penn Education Commons, Room 230)

Matt Rosenberg, rmatt@seas.upenn.edu
Office hour: Wednesdays 4:00-5:00pm (Moore 207)
Course description What is the "cloud"? How do we build software systems and components that scale to millions of users and petabytes of data, and are "always available"?

In the modern Internet, virtually all large Web services run atop multiple geographically distributed data centers: Google, Yahoo, Facebook, iTunes, Amazon, eBay, Bing, etc. Services must scale across thousands of machines, tolerate faults, and support thousands of concurrent requests. Increasingly, the major providers (including Amazon, Google, Microsoft, HP, and IBM) are looking at "hosting" third-party applications in their data centers - forming so-called "cloud computing" services. A significant number of these services also process "streaming" data: geocoding information from cell phones, tweets, streaming video, etc.

This course, aimed at a sophomore with exposure to basic programming within the context of a single machine, focuses on the issues and programming models related to such cloud and distributed data processing technologies: data partitioning, storage schemes, stream processing, and "mostly shared-nothing" parallel algorithms.

Topics covered Datacenter architectures, the MapReduce programming model, Hadoop, cloud algorithms (PageRank, adsorption, friend recommendation, TF/IDF), web programming basics (servlets, AJAX, GWT), higher-level programming (Hive, Pig Latin), ...
Format The format will be two 1.5-hour lectures per week, plus assigned readings. There will be regular homework assignments and a term project, plus a midterm and a final exam.
Prerequisites CIS 120, Introduction to Programming
CIS 160, Discrete Mathematics
Co-requisite: CIS 121, Data Structures
Texts and readings Hadoop: The Definitive Guide, Third Edition, by Tom White (O'Reilly)
Additional materials will be provided as handouts or in the form of light technical papers.
Grading Homework 30%, Midterm 18%, Term project 30%, Participation 2%, Final 20%
Policies You are encouraged to discuss your homework assignments with your classmates; however, any code you submit must be your own work. You may not share code with others or copy code from outside sources, except where the assignment specifically allows it. Plagiarism can have serious consequences.
Resources We will be using Piazza for course-related discussions.
Term project In two-person teams, build a small Facebook-like application using servlets and Google's Web Toolkit. Based on network analysis, the application should make friend recommendations; it should also visualize the social network.

NEW! There will be an award for the best project (sponsored by Facebook).

Assignments Homework assignments are available for download. If necessary, you can request an extension.
Lab sessions The TAs will occasionally hold lab sessions to answer questions about the homework assignments. This semester, the lab sessions will be held on Mondays from noon to 1:30pm in Moore 207. Lab sessions will be announced in advance on Piazza.
Schedule
Date Topic Details Reading Remarks
Sep 06 Introduction Course overview --  
Sep 11 The Cloud Kinds of clouds; cloud applications
Datacenters; utility computing
Web vs. cloud vs. cluster
Armbrust et al.: A View of Cloud Computing HW0
Sep 13 Concurrency Parallel architectures; consistency models
Synchronization; locking
Deadlock and livelock; solutions
  HW1
Sep 18 Faults and failures Internet basics; TCP and IP
Types of faults; challenges
CAP theorem; eventual consistency
Vogels: Eventually consistent HW0 due
Sep 20 Cloud basics Introduction to Amazon Web Services
EC2 and EBS
Other services
Handout: Getting Started with AWS  
Sep 21
Course selection period ends
Sep 25 Cloud storage Key-value stores; concurrency control
S3
SimpleDB
  HW1 MS1 due
Sep 27 Cloud case studies Salesforce.com (Guest lecture: Anand Subramanian)
Netflix; Google Apps
Data Warehousing at Facebook
White, Chapter 16: Case Studies
NY Times article
 
Oct 02 MapReduce Core concepts
Programming model
Examples of MapReduce algorithms
Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters  
Oct 04 Programming in MapReduce Using keys to group
Different kinds of reduce functions
Shuffle implementations
White, Chapter 6: How MapReduce Works HW1 MS2 due
Oct 09
No class -- Andreas at OSDI
Oct 11 Midterm exam (covers topics through Oct 04)    
Oct 12
Last day to drop
Oct 16 Hadoop Basics: Data types, drivers, mappers, reducers
HDFS; dataflow in Hadoop
Fault tolerance in Hadoop
White, Chapter 3: HDFS
White, Chapter 5: Applications
Hadoop Quick Start
HW2
Oct 18 Graph algorithms Iterative MapReduce
Graph representations; SSSP
k-means; Naive Bayes; link analysis
   
Oct 23
Fall break
Oct 25 Random-walk algorithms PageRank
Adsorption
Applications
Baluja et al.: Video suggestion and discovery for YouTube HW2 due; HW3
Oct 30
Class canceled (Hurricane Sandy)
Nov 01 Web programming Client/server versus P2P
Web protocols: DNS, HTTP, ...
How to build a web server; threads vs events
Berners-Lee: Information Management: A Proposal
Google: SPDY white paper
 
Nov 06 Servlets Servlet API; servlet containers; deploying servlets
Managing state; cookies
Web security
  Team project spec
Nov 08 Web services and XML Web services
Data interchange
XML; DTDs; DOM; XML schema
  Form project teams
Nov 13 HW3 due; HW4
Nov 15 Dynamic content JavaScript
AJAX
Google Web Toolkit
GWT Tutorial  
Nov 16
Last day to withdraw
Nov 20 Beyond MapReduce SQL
JDBC and LINQ
Hive
White, Chapter 12: Hive
Stonebraker et al.: MapReduce and parallel DBMSs: friends or foes?
HW4 due
Nov 22
Thanksgiving break -- no class
Nov 27 Hierarchical data Beyond relations
Pig Latin
XQuery
White, Chapter 11: Pig  
Nov 29 Peer-to-peer P2P applications; swarming; incentives
Structured and unstructured overlays; Pastry
P2P security
Rodrigues and Druschel: Peer-to-Peer systems  
Dec 04 Special topics Accountability
Differential privacy
Network forensics
A Case for the Accountable Cloud  
Dec 06 Second midterm exam (covers all topics since Oct 16)    
Dec 08
Reading days
Dec 12
Finals begin; project demos
Dec 19
Finals end
Previous versions Fall 2010  Fall 2011