CIS 455 / 555: Internet and Web Systems (Spring 2014)

Instructor Andreas Haeberlen
Location: 560 Levine Hall
Office hour: Wednesdays 2-3pm
Time and location Location: Moore 216
Mondays + Wednesdays 10:30am - noon
Teaching assistants Xu Wang, xuwang1@seas.upenn.edu
Office hour: Thursdays 4-6pm (Moore 207)

Antonis Papadimitriou, apap@seas.upenn.edu
Office hour: Thursdays 6-7pm (Moore 100A)

Nikos Vasilakis, nvas@seas.upenn.edu
Office hour: Wednesdays 5-6pm (Levine 612)

Bolei Fu, bfu@seas.upenn.edu
Office hour: Tuesdays 3-4pm (Moore 207)

Xiaoyi Chen, xiaoyic@seas.upenn.edu
Office hour: Fridays 1-2pm (Moore 207)

Spencer Lee, lesp@seas.upenn.edu
Office hour: Mondays 3-4pm (Moore 207)

Lab sessions Moore 207, Fridays 4:00-6:00pm (not every week - watch for announcements on Piazza!). Transcripts and other materials are available here.
Course description This course focuses on the issues encountered in building Internet and web systems: scalability, interoperability (of data and code), atomicity and consistency models, replication, and location of resources, services, and data. Note that it is not about building database-backed or PHP/JSP/Servlet-based web sites (for this, see CIS 330/550 or NETS 212). Here, we will learn how a Servlet server itself is built!

We will examine how XML standards enable information exchange; how web services support cross-platform interoperability (and what their limitations are); how "cloud computing" services work; how to do replication and Akamai-like content distribution; and how application servers provide transaction support in distributed environments. We will study techniques for locating machines, resources, and data (including directory systems, information retrieval indexing and ranking, web search, and publish/subscribe systems); we will discuss collaborative filtering and mining the Web for patterns; we will investigate how different architectures support scalability (and the issues they face). We will also examine the ideas that have been proposed for tomorrow's Web, including the "Semantic Web", and see some of the challenges, research directions, and potential pitfalls.

An important goal of the course is not simply to discuss issues and solutions, but to provide hands-on experience with a substantial implementation project. This semester's project will be a peer-to-peer implementation of a Googe-style search engine, including distributed, scalable crawling; indexing with ranking; and even PageRank. We will also incorporate the use of topic-specific recognizers and mash-ups.

As a side effect of the material of this course, you will learn about some aspects of large-scale software development: assimilating large APIs, thinking about modularity, reading other people's code, managing versions, debugging, and so on.

CIS555 is now a core course for the MSE degree; for details, please see the MSE requirements. The Daily Pennsylvanian recently published a nice article about CIS455/555.

Format The format will be two 1.5-hour lectures per week, plus assigned readings from handouts. There will be regular homework assignments and a substantial implementation project with experimental validation and a report. There will also be a midterm and a final exam.
Prerequisites This course expects familiarity with threads and concurrency, as well as strong Java programming skills. Those highly proficient in another programming language, such as C++ or C#, should be able to translate their skills easily. The course will require a considerable amount of programming, as well as the ability to work with your classmates in teams.
Texts and readings Distributed Systems: Principles and Paradigms, 2nd ed, by Tanenbaum and van Steen, Prentice Hall
Additional materials will be provided as handouts or in the form of light technical papers.
Grading Homework 32%, midterm 15%, final exam 15%, project 33%, participation 5%.
Other resources We will be using Piazza for course-related discussions; please sign up here. A reading list is also available.
Assignments The homework assignments will be available here. If necessary, you can request an extension (requires PennKey login).
Final project Wondering what you will be able to do at the end of this class? Here is an example from Spring 2013:
 
70! result with preview
Brandon, Edward, Steven, and Mitchell
70! result with video
Brandon Krieger, Steven Krouse, Mitchell Stern, and Edward Wadsworth built "70!", a cloud-based search engine. 70! consists of 1) a scalable distributed crawler that runs on Amazon EC2 instances and uses FreePastry for coordination; 2) an indexer and a PageRank engine that is based on Elastic MapReduce; and 3) a web frontend. Hitchhiker can show previews of the results it returns, and it supports voice-controlled search; as a special feature, users can navigate the results by 'swiping' with their hands in front of a webcam. Google donated four Nexus cell phones as a prize for the best project, and each member of the 70! team received one of the phones.

Honorable mentions went to Team "Ozone" (Swati Goswami, Gokulkrishnan Ragunathan, Sriram Ramanujam, Harini Srinivasan), whose search engine had an impressive "staged" architecture and a stunning front-end design, and to Team "8ball" (Amit Bose, Vikrant Goel, David Gregson, Hari Rajaraman), whose solution supported image search via a separate crawler and offered lightning-fast results.

You can read more about previous Google Award winners and their projects in the CIS455/555 Hall of Fame.

NEW: Google has donated four Nexus 7 tablets for this year's award!

Schedule
Date Topic Details Reading Remarks
Jan 15 Introduction Principles of building systems
Project management & debugging tips
Lampson: Hints for Computer Systems Design  
Jan 20 MLK day -- no class
Jan 22 Classes canceled because of the snow
Jan 27 Server architectures Common server types: Web, application
Architectures: client/server, P2P, multi-tier
Threads, monitors, signals, producer-consumer
Thread pools, event-driven programming
Marshall: HTTP Made Really Easy
Krohn: OKWS paper
HW0, HW1
Jan 29 Krishnamurthy/Rexford Chapter 4
Tanenbaum 3.1
HW0 due
Feb 3 Naming & locating resources Naming and directories; search strategies
LDAP; DNS; DNSSEC
Wikipedia: DNS
Marshall: LDAP intro
 
Feb 5 Indexing Document indexing
B+ tree
Comer: The Ubiquitous B-Tree  
Feb 10 Representing data Data representations, schemas
JPEG, MP3, and QT
XML
XPath and XSLT
Doan, Halevy, Ives: XML  
Feb 12 XSLT Tutorial HW1 MS1 due
Feb 17 Decentralized systems Partly and fully decentralized systems
Key-based routing
Partitioning and consistent hashing
BitTorrent, Chord, Pastry
Druschel and Rodrigues: Peer-to-peer systems  
Feb 26 Stoica et al.: Chord  
Feb 19 Class canceled - Andreas' daughter is born  
Feb 28 Retrieving data Crawling basics
Publish-subscribe; collaborative filtering
Mercator; XFilter
Altinel and Franklin: XFilter
Heydon and Najork: High-Performance Web Crawling
HW1 MS2 due
Mar 3
Mar 5 Midterm HW2
Mar 10 Spring break -- no class
Mar 12
Mar 17 Storing, distributing, retrieving, and processing data Cloud file system
MapReduce programming model
Ghemawat et al.: The Google File System
Dean and Ghemawat: MapReduce
 
Mar 19 HW2 MS1 due
Mar 24 Storing, distributing, retrieving, and processing data Hadoop
HDFS
Shvachko: Apache Hadoop: The Scalability Update  
Mar 26 Code interoperability Remote procedure calls
Web services
SOAP, WSDL, REST
Service composition
XQuery
Tanenbaum chapters 4.2 and 10.3
SOAP tutorial
WSDL tutorial
HW2 MS2 due; HW3
Mar 31 XQuery tutorial  
Apr 2 Documents and ranking Information retrieval models
Web connectivity
Ranking
Web crawlers
HITS and PageRank
Baeza-Yates Chapters 2 and 8
Kleinberg: HITS
Brin and Page: PageRank
Brin and Page: Google
Wired article on Google
Form project groups
Apr 7  
Apr 9 The Cloud Utility computing model
AWS basics; EC2+EBS
Armbrust: A view of Cloud Computing HW3 due
Apr 14 Transactions Application server and TP monitor architectures
ACID properties
Two-phase commit
Tanenbaum chapters 8.5-8.6  
Apr 16 Fault tolerance Replicated state machines
Consensus; Paxos algorithm
Rational behavior and Byzantine faults
Lamport: Paxos (Alternative version)
Schneider: State Machine Approach
Project plan due
Apr 21 Security Web security
Views, ACLs, capabilities; crypto basics
Kerberos; TLS
Tanenbaum chapter 9  
Apr 23 Incremental processing Bigtable
Percolator
Peng and Dabek: Percolator  
Apr 28 Special topics Accountability
Differential privacy
Narayanan and Shmatikov: Robust Deanonymization  
Apr 30 Second midterm
 
Project demos and reports
Previous versions Spring 2013 |  Spring 2012 |  Spring 2011 (taught by Zachary Ives until 2010)