CIS 4550/5550: Internet and Web Systems

Overview | |

CIS 4550/5550: Internet and Web Systems (Spring 2024)

This course focuses on the issues encountered in building Internet and Web systems, such as scalability, interoperability, consistency, replication, fault tolerance, and security. We will examine how services like Google or Amazon handle billions of requests from all over the world each day, (almost) without failing or becoming unreachable. We will study how to collect massive-scale data sets, how to process them, and how to extract useful information from them, and we will have a look at the massive, heavily distributed infrastructure that is used to run these services and similar cloud-based services today.

An important feature of the course is that we will not just discuss issues and solutions but also provide hands-on experience, using web search as our case study. There will be several substantial implementation projects throughout the semester, each of which will focus on a particular component of the search engine, such as frontend, storage, crawler, or indexer. The final project will be to build a Google-style search engine, and to deploy and run it on the cloud.

Notice that this is NOT a course on web design or on web application development! Instead of learning how to use a web server such as Apache or a scalable analytics system such as Spark, we will actually build our own little web server, and a little mini-"Spark", from scratch. As a side effect, you will learn about some aspects of large-scale software development, such as working with APIs and specifications, thinking about modularity, reading other people's code, managing versions, and debugging.

CIS 5550 is now a core course for the MSE degree as well as an option for the WPE I requirement for PhD students. The Daily Pennsylvanian published a nice article about this course.

Instructor

Vincent Liu
Office hours: Wed 2:00 - 3:00 pm (Levine 574)

Teaching assistants

Office hours will temporarily be held on OHQ until room reservations go through.

Bingqing Fan fbqing@seas.upenn.edu OH: Mon 10:00 am - 12:00 pm ET @ OHQ

Cyrus Singer cysinger@seas.upenn.edu OH: Mon 1:00 - 3:00 pm ET @ Levine 3rd floor bump space

Jinwei Bi bijinwei@seas.upenn.edu OH: Mon 7:00 - 9:00 pm ET @ OHQ

Jinhui Luo jinhuil@seas.upenn.edu OH: Tue 3:30 - 5:30 pm ET @ OHQ

Charles Cheng chacheng@seas.upenn.edu OH: Tue 7:00 - 9:00 pm ET @ OHQ

Emily Shang emshg@seas.upenn.edu OH: Wed 10:00 am - 12:00 pm ET @ Levine 3rd floor bump space

Tanvi Dadu tdadu@seas.upenn.edu OH: Wed 6:00 - 8:00 pm ET @ OHQ

Xinran Wang xrwang@seas.upenn.edu OH: Thu 1:00 - 3:00 pm ET @ Levine 501 bump space

Kebin Yan yankebin@seas.upenn.edu OH: Thu 3:00 - 5:00 pm ET @ Levine 601 bump space

Zhengyi Xiao zxiao98@seas.upenn.edu OH: Fri 11:00 am - 1:00 pm ET @ Levine 601 bump space

Ziyu Wang wangziyu@seas.upenn.edu OH: Sat 1:00 pm - 3:00 pm ET @ OHQ

Format

The format will be two 1.5-hour lectures per week, plus assigned readings. There will be regular homework assignments, two in-class midterms, and a substantial implementation project with experimental validation and a report.

Time and location

Mondays and Wednesdays 3:30-5:00pm (LLAB 10)

Prerequisites

This course expects familiarity with threads and concurrency, as well as strong Java programming skills. Those highly proficient in another programming language, such as C++ or C#, should be able to translate their skills easily. The course will require a considerable amount of programming, as well as the ability to work with your classmates in teams.

Textbooks

Distributed Systems: Principles and Paradigms, 3rd edition, by Tanenbaum and van Steen, Prentice Hall (ISBN 978-1530281756).
You can buy a physical copy (e.g., for $35 on Amazon) or download a free digital copy here.

Additional materials will be provided as handouts or in the form of light technical papers.

Grading

Homework 40%, Term project 25%, Exams 30%, Participation 5%

Policies

You can find a list of key course policies here.

Assignments

Homework assignments are available for download. Please join the discussion group as well!

Tentative schedule

Date	Topic	Details	Reading	Remarks
22-Jan	Introduction [Slides]	Introduction Overview Logistics Policies		HW0 released
24-Jan	Internet basics [Slides]	The Internet Interdomain routing; BGP; valley-free Path properties TCP and UDP Socket basics; echo server
29-Jan	The Web [Slides]	The Web; hyperlinks; history of the Web Client-server model HTTP/1, TLS HTML/CSS basics HTTP/2	Lampson: "Hints for Computer System Design Introduction to HTTP/2"	HW0 due; HW1 released
31-Jan	Scalability [Slides]	Parallelization Consistency Mutual exclusion; locking; deadlocks NUMA and Shared-Nothing Frontend-backend, Sharding	Vogels: "Eventually Consistent"
5-Feb	Dynamic content [Slides]	Motivation: Dynamic content Routes Managing state; cookies; sessions Tracking; business model of the web	Spark Framework Overview
7-Feb	The Client Side [Slides]	JavaScript DOM	MDN: A reintroduction to JavaScript	HW1 due; HW2 released
12-Feb	The Client Side (cont.) [Slides]	Dynamic requests AJAX
14-Feb	Naming [Slides]	Name spaces and directories DNS architecture Security issues with DNS DNSSEC, DANE	Globally Distributed Content Delivery	HW2 due; HW3 released
19-Feb	The Cloud [Slides]	Data centers Cloud computing Types of clouds History of Cloud Computing Case study: EC2	Armbrust et al.: "A View of Cloud Computing"
21-Feb	RPCs [Slides]	Web services; APIs; API examples Remote procedure calls Handling RPC failures Data interchange XML	Chapter 4.2 in the Tanenbaum book	HW3 due; HW4 released
26-Feb	Key-value Stores [Slides]	Key-value stores KVS on the Cloud Sharding and coordination Case study: S3 Case study: DynamoDB	Cooper et al.: "PNUTS to Sherpa: Lessons from Yahoo!'s Cloud Database"
27-Feb	Last day to drop
28-Feb	First midterm exam (HW4 due Mar 1; HW5 released)
Mar 2-10	Spring Term Break
11-Mar	Basic fault tolerance [Slides]	Faults and fault models Primary-backup replication	Chapter 7.5 in the Tanenbaum book
13-Mar	Basic fault tolerance (cont) [Slides]	Availability and Durability The CAP theorem Quorum replication		HW5 due; HW6 released
18-Mar	Scalable Analytics [Slides]	Introduction to scalable analytics MapReduce The Streams API Apache Spark Lambdas and serialization	Zaharai et al.: "Spark: Cluster Computing with Working Sets"
20-Mar	Spark basics [Slides]	Spark jobs Working with files Spark transformations Spark actions The Structured API	Zaharia et al.: "Resilient Distributed Datasets"	HW6 due; HW7 released; Project handout released
25-Mar	Spark continued [Slides]	HDFS Apache Livy Distributed shared variables Graph algorithms in Spark	Shvachko: "Apache Hadoop: The Scalability Update"
22-Mar	Last day to pass/fail
27-Mar	Crawling [Slides]	Structure of the Web Crawling basics SEO Crawler etiquette	Heydon and Nayork: "Mercator: A scalable, extensible Web crawler"	HW7 due; HW8 released; Team registrations due; project begins
1-Apr	Information retrieval [Slides]	Basic IR model; precision/recall Boolean model Vector model TF/IDF Stemming and lemmatization	Chapter 1 in "An Introduction to Information Retrieval"
2-Apr	Last day to withdraw
3-Apr	Authoritativeness [Slides]	Motivation: off-page features HITS PageRank Sinks and hogs	Brin and Page: "The PageRank Citation Ranking: Bringing Order to the Web"	HW8 due; HW9 released
8-Apr	Search engines [Slides]	Building a search engine Case study: Google Case study: Mercator Project overview Modern search	Brin and Page: "The Anatomy of a Large-Scale Hypertextual Web Search Engine"
10-Apr	Decentralized systems [Slides]	Centralization and its effects Partly centralized systems Unstructured overlays Structured overlays	Druschel and Rodrigues: "Peer-to-Peer Systems"	HW9 due
15-Apr	Key-based routing; DHTs [Slides]	Consistent hashing and DHTs Key-based routing Basic Chord Fault tolerance in Chord KBR and security	Stoica et al.: "Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications"
17-Apr	Advanced Fault Tolerance [Slides]	Non-crash fault models	Schneider: Implementing Fault-Tolerant Services Using the State Machine Approach
22-Apr	Advanced Fault Tolerance (cont.) [Slides]	State-machine replication Paxos The Byzantine Generals Problem Byzantine Fault Tolerance	Schneider: Implementing Fault-Tolerant Services Using the State Machine Approach
24-Apr	Advanced Fault Tolerance (cont.)	State-machine replication	Schneider: Implementing Fault-Tolerant Services Using the State Machine Approach
29-Apr	Security	Threat models Crypto basics Digital signatures Attacks and Defenses	OWASP Top 10
1-May	Second midterm exam
May 2-5	Reading days
May 6-14	Finals period (in-person project demos)