CIS 455 / 555: Internet and Web Systems (Spring 2011)

Location Crest classroom (Arch building, 3601 Locust Walk)
Monday/Wednesday 10:30am - noon
Instructor Andreas Haeberlen
Location: 560 Levine Hall North (a.k.a. GRW building)
Office hours: Mondays 12:30-1:30pm
Teaching assistants Ben Karel, karel@seas.upenn.edu
Office hour: Fridays 2-3pm (Moore 102)

Alex Lee, leea@seas.upenn.edu
Office hour: Wednesdays 4:30-5:30pm (Levine 612)

Zhongxin Ma, maz@seas.upenn.edu
Office hour: Thursdays 1:30-2:30pm (Levine 612)

Course description This course focuses on the issues encountered in building Internet and web systems: scalability, interoperability (of data and code), atomicity and consistency models, replication, and location of resources, services, and data. Note that it is not about building database-backed or PHP/JSP/Servlet-based web sites (for this, see CIS 330/550 or MKSE 212). Here, we will learn how a Servlet server itself is built!

We will examine how XML standards enable information exchange; how web services support cross-platform interoperability (and what their limitations are); how "cloud computing" services work; how to do replication and Akamai-like content distribution; and how application servers provide transaction support in distributed environments. We will study techniques for locating machines, resources, and data (including directory systems, information retrieval indexing and ranking, web search, and publish/subscribe systems); we will discuss collaborative filtering and mining the Web for patterns; we will investigate how different architectures support scalability (and the issues they face). We will also examine the ideas that have been proposed for tomorrow's Web, including the "Semantic Web", and see some of the challenges, research directions, and potential pitfalls.

An important goal of the course is not simply to discuss issues and solutions, but to provide hands-on experience with a substantial implementation project. This semester's project will be a peer-to-peer implementation of a Googe-style search engine, including distributed, scalable crawling; indexing with ranking; and even PageRank. We will also incorporate the use of topic-specific recognizers and mash-ups.

As a side effect of the material of this course, you will learn about some aspects of large-scale software development: assimilating large APIs, thinking about modularity, reading other people's code, managing versions, debugging, and so on.

Format The format will be two 1.5-hour lectures per week, plus assigned readings from handouts. There will be regular homework assignments and a substantial implementation project with experimental validation and a report. There will also be a midterm and a final exam.
Prerequisites This course expects familiarity with threads and concurrency, as well as strong Java programming skills. Those highly proficient in another programming language, such as C++ or C# should be able to translate their skills easily. The course will require a considerable amount of programming, as well as the ability to work with your classmates in teams.
Texts and readings Distributed Systems: Principles and Paradigms, 2nd ed, by Tanenbaum and van Steen, Prentice Hall
Additional materials will be provided as handouts or in the form of light technical papers.
Grading Homework 25%, midterm 15%, final exam 15%, project 40%, participation 5%.
Other resources Course discussion forum: http://groups.google.com/group/cis455-spring2011
Assignments are available in (frequently updated) electronic form here
Schedule
Date Topic Details Reading Remarks
Jan 12 Introduction Principles of building systems
Project management & debugging tips
Lampson: Hints for Computer Systems Design  
Jan 17 MLK day -- no class
Jan 19 Server architectures Common server types: Web, application
Architectures: client/server, P2P, multi-tier
Threads, monitors, signals, producer-consumer
Thread pools, event-driven programming
Marshall: HTTP Made Really Easy
Krohn: OKWS paper
HW0
Jan 24 Krishnamurthy/Rexford Chapter 4
Tanenbaum 3.1
 
Jan 26 Naming & locating resources Naming and directories; URIs
Search strategies
Content-based addressing
Document indexing
Publish-subscribe
Wikipedia: DNS
Marshall: LDAP intro
HW0 due;
HW1
Jan 31 Heydon and Najork: High-Performance Web Crawling  
Feb 2 Representing data Data representations
Schemas
JPEG, MP3, and QT
XML and XPath
XSLT
Collaborative Filtering
Doan, Halevy, Ives: XML HW1 MS1 due
Feb 7 XSLT Tutorial  
Feb 9 Altinel and Franklin: XFilter  
Feb 14 Decentralized systems Partly and fully decentralized systems
Key-based routing
Partitioning and consistent hashing
BitTorrent, Chord, Pastry
Stoica et al.: Chord  
Feb 16 HW1 MS2 due;
HW 2
Feb 21 Storing, distributing, retrieving, and processing data Cloud file system
MapReduce programming model
Ghemawat et al.: The Google File System
Dean and Ghemawat: MapReduce
 
Feb 23  
Feb 28 Midterm
Mar 2 Storing, distributing, retrieving, and processing data Hadoop
Mercator
Heydon and Najork: Mercator HW 2 MS1 due; form project groups
Mar 7 Spring break -- no class
Mar 9
Mar 14 Code interoperability Remote procedure calls
Web services, SOAP, WSDL, REST
UDDI
Service composition
Tanenbaum chapters 4.2 and 10.3
SOAP tutorial
WSDL tutorial
HW2 MS2 due; HW 3
Mar 16  
Mar 21  
Mar 23 Documents and ranking Information retrieval models
Web connectivity
Ranking
Web crawlers
HITS and PageRank
Baeza-Yates Chapters 2 and 8
Kleinberg: HITS
Brin and Page: PageRank
Brin and Page: Google
Wired article on Google
 
Mar 28  
Mar 30
No class -- Andreas in Boston for NSDI
Apr 4 Transactions Application server and TP monitor architectures
ACID properties
Two-phase commit
Cloud; utility computing model
Tanenbaum chapters 8.5-8.6 HW 3 due; begin project planning
Apr 6  
Apr 11 Fault tolerance Replicated state machines
Consensus; Paxos algorithm
Rational behavior and Byzantine faults
Lamport: Paxos (Alternative version)
Schneider: State Machine Approach
Initial project plan due
Apr 13 Security Web security
Views, ACLs, capabilities; crypto basics
Kerberos; TLS
Tanenbaum chapter 9  
Apr 18 Incremental processing Bigtable
Percolator
Peng and Dabek: Percolator  
Apr 20 Special topics Accountability
Differential privacy
Narayanan et al.: Deanonymization  
Apr 25 Second midterm
  Project demos and reports      
Previous versions Spring'04   Spring'06   Spring'07   Spring'08   Spring'09   Spring '10   (taught by Zachary G. Ives)