NETS 212: Scalable and Cloud Computing (Fall 2018)

Instructor Andreas Haeberlen
Location: 560 Levine Hall
Office hour: Thursdays 2-3pm (Levine 560)
Time and location Tuesdays/Thursdays 4:30-6:00pm
DRLB A1
Teaching assistants
Caroline Cai (cax@wharton.upenn.edu)
Office hour: Tuesdays 3-4pm (GRW 5th floor bump space)

  Edo Roth (edoroth@cis.upenn.edu)
Office hour: Tuesdays 9:30-10:30am (GRW 5th floor bump space)

Victor Chien (vicchien@seas.upenn.edu)
Office hour: Wednesdays 11:00am-noon (GRW 5th floor bump space)

Brian Sandler (bms@seas.upenn.edu)
Office hour: Thursdays 7:00-8:00pm (GRW 5th floor bump space)

Priya Gupta (guppriya@seas.upenn.edu)
Office hour: Thursdays 10:30-11:30am (GRW 5th floor bump space)

Sumit Shyamsukha (ssumit@seas.upenn.edu)
Office hour: Tuesdays 2:00-3:00pm (GRW 5th floor bump space)

Allen He (heallen@seas.upenn.edu)
Office hour: Mondays 4:30-5:30pm (Levine 5th floor bump space)

Andre Wallace (waandre@seas.upenn.edu)
Office hour: Mondays 10:30am-noon (GRW 5th floor bump space)

David Im (imdongk@seas.upenn.edu)
Office hour: Fridays noon-1pm (GRW 5th floor bump space)

Andreas Wang (andreasx@seas.upenn.edu)
Office hour: Wednesdays 4:00-5:00pm (GRW 5th floor bump space)

Terrence Insung Jo (joterry@seas.upenn.edu)
Office hour: Wednesdays 11am-noon (GRW 5th floor bump space)

Tian Yang (tian17@seas.upenn.edu)
Office hour: Fridays, 6:00-7:00pm (GRW 5th floor bump space)

Ipek Ilayda Onur (ionur@seas.upenn.edu)
Office hour: Mondays 3:30-4:30pm (Levine (not GRW!) 5th floor bump space)
 
Course description What is the "cloud"? How do we build software systems and components that scale to millions of users and petabytes of data, and are "always available"?

In the modern Internet, virtually all large Web services run atop multiple geographically distributed data centers: Google, Yahoo, Facebook, iTunes, Amazon, eBay, Bing, etc. Services must scale across thousands of machines, tolerate faults, and support thousands of concurrent requests. Increasingly, the major providers (including Amazon, Google, Microsoft, HP, and IBM) are looking at "hosting" third-party applications in their data centers - forming so-called "cloud computing" services. A significant number of these services also process "streaming" data: geocoding information from cell phones, tweets, streaming video, etc.

This course, aimed at a sophomore with exposure to basic programming within the context of a single machine, focuses on the issues and programming models related to such cloud and distributed data processing technologies: data partitioning, storage schemes, stream processing, and "mostly shared-nothing" parallel algorithms.

NETS212 is a required course for the NETS program and for the Data Science Minor.

Topics covered Datacenter architectures, the MapReduce programming model, Hadoop, cloud algorithms (PageRank, adsorption, friend recommendation, TF/IDF), web programming basics (servlets, AJAX, Node.js/Express, Bootstrap), higher-level programming (Hive, Pig Latin), ...
Format The format will be two 1.5-hour lectures per week, plus assigned readings. There will be regular homework assignments and a term project, plus a midterm and a final exam.
Prerequisites CIS 120, Introduction to Programming
CIS 160, Discrete Mathematics
Co-requisite: CIS 121, Data Structures
Texts and readings Hadoop: The Definitive Guide, Fourth Edition, by Tom White (O'Reilly) (ISBN 978-1-4919-0163-2; read online for free, or buy for approx. $20)
Data-Intensive Text Processing with MapReduce, by Jimmy Lin and Chris Dyer (Morgan & Claypool) (ISBN 978-1608453429; read online for free, or buy for approx. $40)
Additional materials will be provided as handouts or in the form of light technical papers.

Grading Homework 30%, Term project 30%, Exams 35%, Participation 5%
Policies You are encouraged to discuss your homework assignments with your classmates; however, any code you submit must be your own work. You may not share code with others or copy code from outside sources, except where the assignment specifically allows it. Plagiarism can have serious consequences.
Resources We will be using Piazza for course-related discussions.
Term project In three-person teams, build a small Facebook-like application using Node.js and Amazon's DynamoDB. Based on network analysis, the application should make friend recommendations; it should also visualize the social network.
Facebook award In previous years, Facebook sponsored an award for the best term project. You can learn more about the winners in the Hall of Fame.
Assignments Homework assignments will be available for download; you can submit your solution here. If necessary, you can request an extension.
Lab sessions The TAs may occasionally hold lab sessions to provide additional help with topics related to the class.
Schedule Below is the tentative schedule for the course:

Date Topic Details Reading Remarks
Aug 28 Introduction Course overview    
Aug 30 The Cloud Kinds of clouds; cloud applications
Datacenters; utility computing
Web vs. cloud vs. cluster
Armbrust et al.: A View of Cloud Computing HW0
Sep 04 Concurrency Parallel architectures; consistency models
Synchronization; locking
Deadlock and livelock; solutions
Vogels: Eventually consistent  
Sep 06 Faults and failures Internet basics; TCP and IP
Types of faults; challenges
CAP theorem; eventual consistency
Tseitlin: The Antifragile Organization HW0 due; HW1
Sep 11  
Sep 13 Cloud basics Introduction to Amazon Web Services
EC2 and EBS
Other services
Handout: Getting Started with AWS  
Sep 17
Course selection period ends
Sep 18 Cloud storage Key-value stores; concurrency control
S3
DynamoDB
  HW1 MS1 due
Sep 20 MapReduce Core concepts
Programming model
Examples of MapReduce algorithms
Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters
Lin & Dyer, Chapter 2: MapReduce Design
 
Sep 25
Sep 27
Programming in MapReduce Using keys to group
Different kinds of reduce functions
Shuffle implementations
White, Chapter 7: How MapReduce Works
Lin & Dyer, Chapter 3: MapReduce Algorithm Design
HW1 MS2 due
Oct 02 First midterm exam (covers topics through September 27)    
Oct 04–07
Fall break
Oct 08
Last day to drop
Oct 09 Hadoop Basics: Data types, drivers, mappers, reducers
HDFS; dataflow in Hadoop
Fault tolerance in Hadoop
White, Chapter 3: HDFS
White, Chapter 6: Developing a MapReduce Application
HW2
Oct 11 Graph algorithms Iterative MapReduce
Graph representations; SSSP
k-means; Naive Bayes; link analysis
Lin & Dyer, Chapter 5: Graph Algorithms  
Oct 16 Random-walk algorithms PageRank
Adsorption
Applications
Baluja et al.: Video suggestion and discovery for YouTube HW3
Oct 18 Iterative processing RDDs
Spark
Pregel
White, Chapter 19 HW2 due
Oct 23 Web programming Client/server versus P2P
Web protocols: DNS, HTTP, ...
How to build a web server; threads vs events
Berners-Lee: Information Management: A Proposal
Google: SPDY white paper
 
Oct 25 HW3MS1 due
Oct 30 Web services, XML, JSON Web services
Data interchange
XML; DTDs; DOM; XML schema; JSON
  HW3 MS2 due; HW4
Nov 01 Node.js Node.js; Express; EJS
Managing state; cookies
Web security
Getting Started with Express Form project teams
Nov 06 Dynamic content JavaScript
AJAX
Google Maps
   
Nov 08 Security Crypto essentials
Web security
  HW4 MS1 due
Nov 09
Last day to withdraw
Nov 13 Beyond MapReduce SQL
JDBC and LINQ
Hive
White, Chapter 12: Hive
Stonebraker et al.: MapReduce and parallel DBMSs: friends or foes?
 
Nov 15 Peer-to-peer P2P applications; swarming; incentives
Structured and unstructured overlays; Pastry
P2P security
Rodrigues and Druschel: Peer-to-Peer systems  
Nov 20 Case study: Bitcoin Distributed ledgers
Bitcoin
Satoshi Nakamoto: Bitcoin: A Peer-to-Peer Electronic Cash System  
Nov 22
Thanksgiving break — no class
Nov 27 TBA TBA    
Nov 29 TBA TBA    
Dec 04 Case study: Facebook Storage at Facebook
TAO
Haystack
   
Dec 06 Second midterm exam (covers all topics since the first midterm)    
Dec 11–12
Reading days
Dec 13
Finals begin; project demos
Dec 20
Finals end