CIS 455 / 555: Internet and Web Systems (Spring 2020)

Instructor Andreas Haeberlen
Office hour: Fridays 9–10am
Zoom link
Time and location Location: Berger Auditorium Zoom (see Piazza for link and password)
Mondays + Wednesdays 10:30am – noon
Teaching assistants
Azzam Althagafi,
Office hour: Thursdays noon-1pm (Zoom)

  Han Yan,
Office hour: Wednesdays 5:30-6:30pm (Zoom)

Saniyah Shaikh,
Office hour: Saturdays 3:00-4:00pm (Zoom)

Rishab Jaggi,
Office hour: Mondays 5:00-6:00pm (Zoom)

Varad Deshpande,
Office hour: Tuesdays 3:00-4:00pm (Zoom)
Srinath Rajagopalan
Office hour: Fridays 3:00-4:00pm (Zoom)
Course description This course focuses on the issues encountered in building Internet and web systems: scalability, interoperability (of data and code), atomicity and consistency models, replication, and location of resources, services, and data. Note that it is not about building database-backed or PHP/JSP/Node-based web sites (for this, see CIS 450/550 or NETS 212). This course is also not about the protocols that underpin these tasks (for that, see CIS 553). Here, we will learn how a web server itself is built!

Similarly, the course covers stream processors and "big data analytics" platforms like MapReduce, Apache Storm, Spark, etc. -- from the perspective of how they work. For details on using such systems, see CIS 545. Here you'll actually build such systems! We will examine how XML standards enable information exchange; how web services support cross-platform interoperability (and what their limitations are); how "cloud computing" services work; how to do replication and Akamai-like content distribution; and how application servers provide transaction support in distributed environments. We will study techniques for locating machines, resources, and data (including directory systems, information retrieval indexing and ranking, web search, and publish/subscribe systems); we will discuss collaborative filtering and mining the Web for patterns; we will investigate how different architectures support scalability and distributed coordination (and the issues they face). We will also examine the ideas that have been proposed for tomorrow's Web, and see some of the challenges, research directions, and potential pitfalls.

An important goal of the course is not simply to discuss issues and solutions, but to provide hands-on experience with a substantial implementation project. This semester's project will be a peer-to-peer implementation of a Google-style search engine, including distributed, scalable crawling; indexing with ranking; stream processing; and even PageRank on your own MapReduce-style implementation!

As a side effect of the material of this course, you will learn about some aspects of large-scale software development: assimilating large APIs, thinking about modularity, reading other people's code, managing versions, debugging, and so on.

CIS555 is now a core course for the MSE degree, as well as an option for the WPE-I requirement for PhD students. If you are taking the course for your WPE-I, please let Andreas know as soon as you register. The exam-based option is not available for CIS555.

The Daily Pennsylvanian published a nice article about CIS455/555.

Format The format will be two 1.5-hour lectures per week, plus assigned readings from handouts. There will be regular homework assignments and a substantial implementation project with experimental validation and a report. There will also be two in-class midterms.
Prerequisites This course expects familiarity with threads and concurrency, as well as strong Java programming skills. Those highly proficient in another programming language, such as C++ or C#, should be able to translate their skills easily. The course will require a considerable amount of programming, as well as the ability to work with your classmates in teams.
Texts and readings Distributed Systems: Principles and Paradigms, 3rd edition, by Tanenbaum and van Steen, Prentice Hall (ISBN 978-1530281756)
You can buy a physical copy (e.g., for $35 on Amazon) or download a free digital copy here.
Additional materials will be provided as handouts or in the form of light technical papers.
Grading Homework 32%, first midterm 15%, second midterm 15%, project 33%, participation 5%.
Other resources We will be using Piazza for course-related discussions; please sign up here. A reading list is also available.
Assignments The homework assignments will be available here. You can submit your solutions online (requires PennKey login).
Final project Wondering what you will be able to do at the end of this class? Here is an example from an earlier class:
Example results from PennCH3
Hung, Hitali, Chirag, and Harsh
Searching with Alexa
The "PennCH3" search engine, which was built by Hung Nguyen, Chirag Shah, Hitali Sheth, and Harsh Verma, not only searched the web but also displayed information from a variety of sources, including soccer scores, stock quotes, weather forecasts, and shopping results. Users were able to submit searches using their Alexa-enabled devices. Under the hood, the system was highly scalable and used replication for fault tolerance; the team (boldly) proved this by killing some of the nodes during a live demonstration. Google donated four Google Home devices as a prize for this project.

Date Topic Details Reading Remarks
Jan 15 Introduction Principles of building systems
Project management & debugging tips
Lampson: Hints for Computer Systems Design  
Jan 20 MLK day — no class
Jan 22 Server architectures
(taught by Prof. Ives)
Common server types: Web, applicatio
Architectures: client/server, P2P, multi-tier
Marshall: HTTP Made Really Easy
Tanenbaum 3.1
Jan 27 No class (Andreas traveling)
Jan 29 Server architectures (contd.)
(taught by Prof. Ives)
Threads, monitors, signals, producer-consumer
Thread pools, event-driven programming
Krishnamurthy/Rexford Chapter 4
Krohn: OKWS paper
Feb 3 Server architectures Continued from last time    
Feb 5 Virtualization Virtualization
Union filesystems; containers
Merkel: Docker HW0 due
Feb 10 Naming & locating resources Naming and directories; search strategies
Wikipedia: DNS
Marshall: LDAP intro
Feb 12 Indexing Document indexing
B+ tree
Comer: The Ubiquitous B-Tree HW1MS1 due
(on Feb 14)
Feb 17 Data formats and data interchange Data representations
DTDS and XML Schema; DOM
Doan, Halevy, Ives: XML
Feb 19  
Feb 24 Decentralized systems Partly and fully decentralized systems
Key-based routing
Druschel and Rodrigues: Peer-to-peer systems  
Feb 24 Last day to drop
Feb 26 Key-based routing Partitioning and consistent hashing
BitTorrent, Chord
Stoica et al.: Chord HW1 MS2 due; HW2
Mar 2 Retrieving data Crawling basics
Publish-subscribe; collaborative filtering
Mercator; XFilter
Altinel and Franklin: XFilter
Heydon and Najork: High-Performance Web Crawling
Mar 4 First midterm
Mar 7–15 Spring break — no class
Mar 16–22 Spring break, Episode II (due to COVID-19) — still no class
Mar 23 Storing data (MP4) Cloud file system Ghemawat et al.: The Google File System
Mar 25 Processing data (MP4) MapReduce programming model Dean and Ghemawat: MapReduce HW2 MS1 due (on March 27); HW3
Mar 30 Processing data (contd.) (MP4) Hadoop Shvachko: Apache Hadoop: The Scalability Update  
Mar 30 Last day to withdraw
Apr 1 Code interoperability (MP4) Remote procedure calls
Web services
Service composition
Tanenbaum chapters 4.2 and 10.3 Form project groups; HW2 MS2 due (on Apr 3)
Apr 6 Documents and ranking (MP4) Information retrieval models
Web connectivity
An Introduction to Information Retrieval, Chapters 1, 2, and 6  
Apr 8 Documents and ranking (contd.) (MP4) HITS and PageRank
Search engine design
Kleinberg: HITS
Brin and Page: PageRank
Brin and Page: Google
Wired article on Google
HW3 and project plan due (on Apr 10)
Apr 13 The Cloud (MP4) Utility computing model
AWS basics; EC2+EBS
Armbrust: A view of Cloud Computing  
Apr 15 Transactions (MP4) Application server and TP monitor architectures
ACID properties
Two-phase commit
Tanenbaum chapters 8.5-8.6  
Apr 20 Fault tolerance (MP4) Replicated state machines
Consensus; Paxos algorithm
Rational behavior and Byzantine faults
Lamport: Paxos (Alternative version)
Schneider: State Machine Approach
Apr 22 Security (MP4) Web security
Views, ACLs, capabilities; crypto basics
Kerberos; TLS
Tanenbaum chapter 9  
Apr 27 Incremental processing (MP4) Bigtable
Peng and Dabek: Percolator  
Apr 29 Second midterm
May 4–12
Project demos (via Zoom) and reports