CIS 455 / 555: Internet and Web Systems (Fall 2019)

Instructor

Vincent Liu (liuv@seas)
Location: 574 Levine Hall
Office hours: Wednesdays 4:30 – 5:30pm

Time and location

Location: Towne 311
Mondays + Wednesdays 3:00pm – 4:30pm

Teaching assistants

Cameron Cabo, ccabo@seas
    Mondays 5:00 – 6:00pm, Levine 6th floor bump space
Rishab Jaggi, rjaggi@seas
    Tuesdays 5:00 – 6:00pm, Levine 5th floor bump space
Nofel Yaseen, nyaseen@seas
    Thursdays 11:00am – 12:00pm, Levine 6th floor bump space

Course description

This course focuses on the issues encountered in building Internet and web systems: scalability, interoperability (of data and code), atomicity and consistency models, replication, and location of resources, services, and data. Note that it is not about building database-backed or PHP/JSP/Node-based web sites (for this, see CIS 450/550 or NETS 212). This course is also not about the protocols that underpin these tasks (for that, see CIS 553). Here, we will learn how a web server itself is built!

Similarly, the course covers stream processors and "big data analytics" platforms like MapReduce, Apache Storm, Spark, etc. -- from the perspective of how they work. For details on using such systems, see CIS 545. Here you'll actually build such systems!

We will examine how XML standards enable information exchange; how web services support cross-platform interoperability (and what their limitations are); how "cloud computing" services work; how to do replication and Akamai-like content distribution; and how application servers provide transaction support in distributed environments. We will study techniques for locating machines, resources, and data (including directory systems, information retrieval indexing and ranking, web search, and publish/subscribe systems); we will discuss collaborative filtering and mining the Web for patterns; we will investigate how different architectures support scalability and distributed coordination (and the issues they face). We will also examine the ideas that have been proposed for tomorrow's Web, and see some of the challenges, research directions, and potential pitfalls.

An important goal of the course is not simply to discuss issues and solutions, but to provide hands-on experience with a substantial implementation project. This semester's project will be a peer-to-peer implementation of a Google-style search engine, including distributed, scalable crawling; indexing with ranking; stream processing; and even PageRank on your own MapReduce-style implementation!

As a side effect of the material of this course, you will learn about some aspects of large-scale software development: assimilating large APIs, thinking about modularity, reading other people's code, managing versions, debugging, and so on.

CIS555 is now a core course for the MSE degree as well as an option for the WPE I requirement for PhDs. The Daily Pennsylvanian published a nice article about CIS455/555.

Format

The format will be two 1.5-hour lectures per week, plus assigned readings from handouts. There will be regular homework assignments and a substantial implementation project with experimental validation and a report. There will also be two in-class midterms.

Prerequisites

This course expects familiarity with threads and concurrency, as well as strong Java programming skills. Those highly proficient in another programming language, such as C++ or C#, should be able to translate their skills easily. The course will require a considerable amount of programming, as well as the ability to work with your classmates in teams.

Texts and readings

Distributed Systems: Principles and Paradigms, 3rd edition, by Tanenbaum and van Steen, Prentice Hall (ISBN 978-1530281756).
You can buy a physical copy (e.g., for $35 on Amazon) or download a free digital copy here.

Additional materials will be provided as handouts or in the form of light technical papers.

Grading

Homework 32%, first midterm 15%, second midterm 15%, project 33%, participation 5%.

Other resources

We will be using Piazza for course-related discussions; please sign up here. A reading list is also available.

Assignments

The homework assignments will be available here. You can submit your solutions online (requires Google@SEAS login).

Final project

Wondering what you will be able to do at the end of this class? Here is an example from Spring 2017:
Hung, Hitali, Chirag, and Harsh

Example results from PennCH3
Searching with Alexa
The 2017 Google Award for the best final project went to Hung Nguyen, Chirag Shah, Hitali Sheth, and Harsh Verma for their "PennCH3" search engine. PennCH3 not only searches the web but also displays information from a variety of sources, including soccer scores, stock quotes, weather forecasts, and shopping results. Users can also submit searches using their Alexa-enabled devices. Under the hood, the system is highly scalable and uses replication for fault tolerance; the team (boldly) proved this by killing some of the nodes during a live demonstration. Google generously donated four Google Home devices as a prize, and each member of the PennCH3 team received one of the devices.

A honorable mention went to project "Q+A" (Mani Mahesh, Archith Shivanagere, Rishisingh Solanki, and Sanidhya Tiwari), whose solution featured location-specific results and used reinforcement learning to improve the results based on feedback from the users.

You can read more about previous Google Award winners and their projects in the CIS455/555 Hall of Fame.

Schedule

Schedule is in flux. Please check back regularly.
Date Topic Details Reading
Aug. 28 Introduction
[slides]
Principles of building systems
Project management & debugging tips
Homework 0 Out (Sep. 9)
– Lampson: Hints for Computer Systems Design
Sep. 2 No class (Labor Day)
Sep. 4 No class (Instructor Out)
Sep. 9 Distributed computing
[slides]
Distributed architectures
Homework 0 Due
Homework 1 Out (M1: Sep. 25; M2: Oct. 7)
– Tanenbaum 3.1
Sep. 11 Server architectures
[slides]
Threads, thread pools, signals, producer-consumer – Marshall: HTTP Made Really Easy
– Krohn: OKWS paper
Spark Framework
Sep. 16 HTTP Servers
[slides]
Monitors, Event-driven programming – Krishnamurthy/Rexford: Chapter 4
Sep. 18 Containers and Server Virtualization
[slides]
Virtualization, containerization, Docker, Kubernetes – Merkel: Docker
Sep. 23 Naming & locating resources
[slides]
Naming and directories, search strategies, LDAP, DNS – Wikipedia: DNS
– Marshall: LDAP intro
Sep. 25 Naming & locating resources
[slides]
Naming and directories, search strategies, LDAP, DNS
Homework 1 M1 Due
Sep. 30 Indexing
[slides]
Document indexing, B+ tree
– Comer: The Ubiquitous B-Tree
Oct. 2 Data Formats and Data Interchange
[slides]
Data representations, XML, DTDS and XML Schema, DOM, XPath – Doan, Halevy, Ives: XML
Oct. 7 Decentralization and DHTs
[slides]
Partitioning and consistent hashing
BitTorrent, Chord
Homework 1 M2 Due
– Stoica et al.: Chord
Oct. 9 Midterm 1 Homework 2 Out (M1: Oct. 23; M2: Nov. 6)
Oct. 14 Crawling
[slides]
Crawling basics, Publish-subscribe, collaborative filtering, Mercator, XFilter – Heydon and Najork: High-Performance Web Crawling
– Altinel and Franklin: XFilter
Oct. 16 Storing Data Distributed filesystems, Google GFS / HDFS – Ghemawat et al.: The Google File System
Oct. 21 Processing Data MapReduce programming model
Apache Spark - RDDs and more
– Dean and Ghemawat: MapReduce
– Zaharia et al.: Resilient Distributed Datasets
Oct. 23 Processing Data II Hadoop
Homework 2 M1 Due
– Shvachko: Apache Hadoop: The Scalability Update
Oct. 28 Code Interoperability Remote procedure calls
Web services
SOAP, WSDL, REST
Service composition
– Tanenbaum chapters 4.2 and 10.3
Oct. 30 Documents and Ranking Information retrieval models
Web connectivity
Ranking
An Introduction to Information Retrieval: Chapters 1, 2, and 6
Nov. 4 Unconfirmed: No class (Instructor out)
Nov. 6 Documents and Ranking II Web crawlers
HITS and PageRank
Homework 2 M2 Due
Final Project Team and Plan (Nov. 11)
Homework 3 Out (Nov. 18)
– Kleinberg: HITS
– Brin and Page: PageRank
– Brin and Page: Google
Wired article on Google
Nov. 11 The Cloud Utility computing model
AWS basics; EC2+EBS
– Armbrust: A view of Cloud Computing
Nov. 13 Transactions Application server and TP monitor architectures
ACID properties
Two-phase commit
Project Plan Due
– Tanenbaum chapters 8.5-8.6
Nov. 18 Fault Tolerance Replicated state machines
Consensus; Paxos algorithm
Rational behavior and Byzantine faults
– Lamport: Paxos (Alternative version)
– Schneider: State Machine Approach
Nov. 20 Security Web security
Views, ACLs, capabilities; crypto basics
Kerberos; TLS
Homework 3 Due
– Tanenbaum chapter 9
Nov. 25 Incremental Processing Bigtable
Percolator
– Peng and Dabek: Percolator
Nov. 27 No class (Fri Schedule / Thanksgiving Break)
Dec. 2 Special Topics
Dec. 4 Special Topics
Dec. 9 Midterm 2