CIS 455 / 555: Internet and Web Systems (Spring 2017)

Instructor Andreas Haeberlen
Location: 560 Levine Hall
Office hour: Monday 1:00-2:00pm
Time and location Location: Berger Auditorium
Mondays + Wednesdays 10:30am – noon
Teaching assistants Harshal Lehri, halehri@seas.upenn.edu
Office hour: Fridays 2:00-3:30pm (Levine 6th floor bump space)

Ruifu Zhang, ruifu@seas.upenn.edu
Office hour: Tuesdays 11:00am-12:30pm (Levine 5th floor bump space)

Matthew Howard, mhowar@seas.upenn.edu
Office hour: Wednesdays 3:30-5:00pm (Levine 6th floor bump space)

Wei Hu, huwei@seas.upenn.edu
Office hour: Thursdays 2:30-4:00pm (Levine 6th floor bump space)

Timothy Clancy, clancyt@seas.upenn.edu
Office hour: Thursdays 10:30am-noon (Levine 5th floor bump space)

Zhiyuan Li, zhil@seas.upenn.edu
Office hour: Wednesdays 1:30pm-3:00pm (Education Commons Room 235)

Graham Mosley, gmosley@seas.upenn.edu
Office hour: Mondays 4:30-5:50pm (Levine 6th floor bump space)

Spiro Metaxas, smetaxas@seas.upenn.edu
Office hour: Fridays 3:30-5:00pm (Levine 6th floor bump space)

Course description This course focuses on the issues encountered in building Internet and web systems: scalability, interoperability (of data and code), atomicity and consistency models, replication, and location of resources, services, and data. Note that it is not about building database-backed or PHP/JSP/Servlet-based web sites (for this, see CIS 450/550 or NETS 212). Here, we will learn how a Servlet server itself is built!

We will examine how XML standards enable information exchange; how web services support cross-platform interoperability (and what their limitations are); how "cloud computing" services work; how to do replication and Akamai-like content distribution; and how application servers provide transaction support in distributed environments. We will study techniques for locating machines, resources, and data (including directory systems, information retrieval indexing and ranking, web search, and publish/subscribe systems); we will discuss collaborative filtering and mining the Web for patterns; we will investigate how different architectures support scalability (and the issues they face). We will also examine the ideas that have been proposed for tomorrow's Web, including the "Semantic Web", and see some of the challenges, research directions, and potential pitfalls.

An important goal of the course is not simply to discuss issues and solutions, but to provide hands-on experience with a substantial implementation project. This semester's project will be a peer-to-peer implementation of a Google-style search engine, including distributed, scalable crawling; indexing with ranking; and even PageRank. We will also incorporate the use of topic-specific recognizers and mash-ups.

As a side effect of the material of this course, you will learn about some aspects of large-scale software development: assimilating large APIs, thinking about modularity, reading other people's code, managing versions, debugging, and so on.

CIS555 is now a core course for the MSE degree; for details, please see the MSE requirements. The Daily Pennsylvanian published a nice article about CIS455/555.

Format The format will be two 1.5-hour lectures per week, plus assigned readings from handouts. There will be regular homework assignments and a substantial implementation project with experimental validation and a report. There will also be two in-class midterms.
Prerequisites This course expects familiarity with threads and concurrency, as well as strong Java programming skills. Those highly proficient in another programming language, such as C++ or C#, should be able to translate their skills easily. The course will require a considerable amount of programming, as well as the ability to work with your classmates in teams.
Texts and readings Distributed Systems: Principles and Paradigms, 2nd ed, by Tanenbaum and van Steen, Prentice Hall (ISBN 978-1530281756)
You can buy a physical copy (e.g., for $35 on Amazon) or download a free digital copy here.
Additional materials will be provided as handouts or in the form of light technical papers.
Grading Homework 32%, first midterm 15%, second midterm 15%, project 33%, participation 5%.
Other resources We will be using Piazza for course-related discussions; please sign up here. A reading list is also available.
Assignments The homework assignments will be available here. You can submit your solutions online (requires PennKey login).
Final project Wondering what you will be able to do at the end of this class? Here is an example from Spring 2016:
 
Example search result
Spiro, Chad, Matthew, and JJ
Gesture-controlled mouse
Matthew, JJ, Spiro, and Chad built a scalable, cloud-based search engine called "AhaeStack", which included widgets for sports scores and shopping results, a weather forecast, and a stock ticker, as well as a special hardware device (shown above) that can be used to control the engine with hand gestures. Google donated four Nexus tablets as a prize for the best project, and each member of the AhaeStack team received one of the tablets.

A honorable mention went to team "Sensei+3" (Anupam Alur, Harshal Lehri, Rishab Gupta, and Vidur S. Bhatnagar), whose solution had an impressive modular design and featured, among other things, a full eBay integration.

You can read more about previous Google Award winners and their projects in the CIS455/555 Hall of Fame.

Schedule
Date Topic Details Reading Remarks
Jan 11 Introduction Principles of building systems
Project management & debugging tips
Lampson: Hints for Computer Systems Design  
Jan 16 MLK day — no class
Jan 18 Server architectures Common server types: Web, application
Architectures: client/server, P2P, multi-tier
Marshall: HTTP Made Really Easy
Tanenbaum 3.1
HW0
Jan 23 Server architectures Threads, monitors, signals, producer-consumer
Thread pools, event-driven programming
Krishnamurthy/Rexford Chapter 4
Krohn: OKWS paper
HW1
Jan 25 Naming & locating resources Naming and directories; search strategies
LDAP; DNS; DNSSEC
Wikipedia: DNS
Marshall: LDAP intro
HW0 due
Jan 30 Indexing Document indexing
B+ tree
Comer: The Ubiquitous B-Tree  
Feb 1 Java servlets Servlet API
Servlet containers
   
Feb 6 No class — Andreas in New York for S&P PC meeting HW1 MS1 due
Feb 8 Data formats and data interchange Data representations
XML
DTDS and XML Schema; DOM
XPath
Doan, Halevy, Ives: XML  
Feb 13  
Feb 15 Decentralized systems Partly and fully decentralized systems
Key-based routing
Druschel and Rodrigues: Peer-to-peer systems  
Feb 20 Key-based routing Partitioning and consistent hashing
BitTorrent, Chord
Stoica et al.: Chord HW1 MS2 due
Feb 22 Retrieving data Crawling basics
Publish-subscribe; collaborative filtering
Mercator; XFilter
Altinel and Franklin: XFilter
Heydon and Najork: High-Performance Web Crawling
HW2
Feb 27 Storing data Cloud file system Ghemawat et al.: The Google File System
 
Mar 1 First midterm
Mar 4–12 Spring break — no class
Mar 13 Processing data MapReduce programming model Dean and Ghemawat: MapReduce  
Mar 15 Processing data (contd.) Hadoop Shvachko: Apache Hadoop: The Scalability Update HW2 MS1 due
Mar 20 Code interoperability Remote procedure calls
Web services
SOAP, WSDL, REST
Service composition
XQuery
Tanenbaum chapters 4.2 and 10.3  
Mar 22 Code interoperability (contd.) XQuery tutorial
Mar 27 Documents and ranking Information retrieval models
Web connectivity
Ranking
Baeza-Yates Chapters 2 and 8 Form project groups
Mar 29 No class — Andreas in Boston for NSDI HW2 MS2 due; HW3
Apr 3 Documents and ranking (contd.) Web crawlers
HITS and PageRank
Kleinberg: HITS
Brin and Page: PageRank
Brin and Page: Google
Wired article on Google
 
Apr 5 The Cloud Utility computing model
AWS basics; EC2+EBS
Armbrust: A view of Cloud Computing  
Apr 10 Transactions Application server and TP monitor architectures
ACID properties
Two-phase commit
Tanenbaum chapters 8.5-8.6  
Apr 12 Fault tolerance Replicated state machines
Consensus; Paxos algorithm
Rational behavior and Byzantine faults
Lamport: Paxos (Alternative version)
Schneider: State Machine Approach
HW3 due; Project plan due
Apr 17 Security Web security
Views, ACLs, capabilities; crypto basics
Kerberos; TLS
Tanenbaum chapter 9  
Apr 19 Incremental processing Bigtable
Percolator
Peng and Dabek: Percolator  
TBA Special topics Accountability
Differential privacy
Narayanan and Shmatikov: Robust Deanonymization  
Apr 24 No class — Andreas in Boston for PODC PC meeting
Apr 26 Second midterm
May 1–9
Project demos and reports