|
|
|
|
|

CIS 455 / 555: Internet and Web Systems (Spring 2012)
|
| Instructor |
|
Andreas Haeberlen
Location: 560 Levine Hall North (a.k.a. GRW building)
Office hours: Thursdays 1:00-2:00pm
|
|
| Location |
|
Skirkanich auditorium
Mondays + Wednesdays 9:30am -11:00am
|
|
| Announcements |
|
This year's award for the best final project goes to
Project "Hitchhiker"
Dhruv Arya, Santhosh Kumar Balakrishnan, Saurabh Garg, Chetan Singh
|
|
|
|
|
|
|
Hitchhiker's homepage
|
Results for "Apple"
|
|
|
|
Dhruv, Santhosh, Saurabh, and Chetan
|
More search options
|
Page preview
|
|
Dhruv, Shantosh, Saurabh, and Chetan built "Hitchhiker" a cloud-based
search engine. Hitchhiker consists of 1) a scalable distributed crawler that runs on
Amazon EC2 instances and uses FreePastry for coordination; 2) an indexer and a PageRank
engine that is based on Elastic MapReduce; and 3) a web frontend. Hitchhiker also
contains a number of extra features, including page previews, a "safe search"
to filter out explicit results, and a special search for the visually challenged,
which enables the user to control the search entirely with spoken commands.
Google donated four Nexus cell phones as a
prize for the best project, and each member of the Hitchhiker team received one of
the phones.
Honorable mentions go to the following teams (in no particular order):
|
Project "The Omniscient Search Engine"
Pratikkumar Patel, Yat Ming Ho, Tianming Zheng, Jiehua Zhu
|
Project "Splend"
Mengyao Chai, Ruogu Hu, Zhou Tan, Chen Yang
|
Congratulations to the winning teams!
|
|
| Teaching assistants |
|
Mingchen Zhao, mizhao@cis.upenn.edu Office hour:
Fridays 12:30-13:30pm (Levine 612)
Arjun Narayan narayana@cis.upenn.edu Office hour:
Wednesdays 11:00am-noon (Levine 512)
Prakashkumar Thiagarajan, tprak@seas.upenn.edu Office hour:
Mondays 4:00-5:00pm (Levine 612)
Jizhi Hu, hujizhi@seas.upenn.edu Office hour:
Tuesdays 3:00-4:00pm (Levine 612)
Siyin Gu, gusiyin@seas.upenn.edu Office hour:
Mondays 3:00-4:00pm (Moore 207)
Yue Ning, yning@seas.upenn.edu Office hour:
Thursdays 3:00-4:00pm (Levine 512)
|
|
| Course description |
|
This course focuses on the issues encountered in building Internet and web systems:
scalability, interoperability (of data and code), atomicity and consistency models,
replication, and location of resources, services, and data. Note that it is not
about building database-backed or PHP/JSP/Servlet-based web sites (for this, see
CIS 330/550 or
MKSE 212).
Here, we will learn how a Servlet server itself is built!
We will examine how XML standards enable information exchange; how web services
support cross-platform interoperability (and what their limitations are); how
"cloud computing" services work; how to do replication and Akamai-like
content distribution; and how application servers provide transaction support in
distributed environments. We will study techniques for locating machines, resources,
and data (including directory systems, information retrieval indexing and ranking,
web search, and publish/subscribe systems); we will discuss collaborative filtering
and mining the Web for patterns; we will investigate how different architectures
support scalability (and the issues they face). We will also examine the ideas that
have been proposed for tomorrow's Web, including the "Semantic Web", and
see some of the challenges, research directions, and potential pitfalls.
An important goal of the course is not simply to discuss issues and solutions,
but to provide hands-on experience with a substantial implementation project.
This semester's project will be a peer-to-peer implementation of a Googe-style
search engine, including distributed, scalable crawling; indexing with ranking;
and even PageRank. We will also incorporate the use of topic-specific recognizers
and mash-ups.
As a side effect of the material of this course, you will learn about some aspects
of large-scale software development: assimilating large APIs, thinking about
modularity, reading other people's code, managing versions, debugging, and so on.
|
|
| Format |
|
The format will be two 1.5-hour lectures per week, plus assigned readings from
handouts. There will be regular homework assignments and a substantial
implementation project with experimental validation and a report. There will
also be a midterm and a final exam.
|
|
| Prerequisites |
|
This course expects familiarity with threads and concurrency, as well as strong
Java programming skills. Those highly proficient in another programming language,
such as C++ or C#, should be able to translate their skills easily. The course
will require a considerable amount of programming, as well as the ability to work
with your classmates in teams.
|
|
| Texts and readings |
|
Distributed Systems: Principles and Paradigms, 2nd ed, by Tanenbaum and van Steen, Prentice Hall
Additional materials will be provided as handouts or in the form of light technical papers.
|
|
| Grading |
|
Homework 25%, midterm 15%, final exam 15%, project 40%, participation 5%.
|
|
| Other resources |
|
We will be using Piazza for course-related discussions; please sign up
here. A reading list is also available.
|
|
| Assignments |
|
are available here.
|
|
| Schedule |
|
| Date |
Topic |
Details |
Reading |
Remarks |
| Jan 11 |
Introduction |
Principles of building systems Project management & debugging tips |
Lampson: Hints for
Computer Systems Design |
|
| Jan 16 |
MLK day -- no class |
| Jan 18 |
Server architectures |
Common server types: Web, application
Architectures: client/server, P2P, multi-tier Threads, monitors, signals,
producer-consumer Thread pools, event-driven programming |
Marshall: HTTP Made
Really Easy Krohn: OKWS paper |
HW0 |
| Jan 23 |
Krishnamurthy/Rexford Chapter 4 Tanenbaum 3.1 |
HW0 due; HW1 |
| Jan 25 |
Naming & locating resources |
Naming and directories; search strategies LDAP; DNS; DNSSEC
|
Wikipedia: DNS
Marshall: LDAP intro
|
|
| Jan 30 |
Indexing |
Document indexing B+ tree |
Comer: The
Ubiquitous B-Tree |
|
| Feb 1 |
Representing data |
Data representations, schemas JPEG, MP3, and QT XML XPath and XSLT |
Doan, Halevy, Ives: XML |
|
| Feb 6 |
XSLT Tutorial
|
HW1 MS1 due |
| Feb 8 |
Decentralized systems |
Partly and fully decentralized systems Key-based routing
Partitioning and consistent hashing BitTorrent, Chord, Pastry |
Druschel and Rodrigues: Peer-to-peer systems
|
|
| Feb 13 |
Stoica et al.: Chord |
|
| Feb 15 |
Retrieving data |
Crawling basics Publish-subscribe; collaborative filtering Mercator; XFilter |
Altinel and Franklin: XFilter
Heydon and Najork: High-Performance Web Crawling
|
|
| Feb 20 |
Storing, distributing, retrieving, and processing data |
Cloud file system MapReduce programming model |
Ghemawat et al.: The Google File System
Dean and Ghemawat: MapReduce
|
HW1 MS2 due |
| Feb 22 |
|
| Feb 27 |
Midterm |
| Feb 29 |
Storing, distributing, retrieving, and processing data |
Hadoop HDFS |
Shvachko: Apache
Hadoop: The Scalability Update
|
HW2 MS1 due (on Mar 7) |
| Mar 5 |
Spring break -- no class |
| Mar 7 |
| Mar 12 |
Code interoperability |
Remote procedure calls Web services SOAP, WSDL, REST Service
composition XQuery |
Tanenbaum chapters 4.2 and 10.3
SOAP tutorial
WSDL tutorial |
Form project groups |
| Mar 14 |
|
| Mar 19 |
XQuery tutorial
|
|
| Mar 21 |
No class (Andreas in New York for
USENIX ATC PC meeting) |
HW2 MS2 due |
| Mar 26 |
Documents and ranking |
Information retrieval models
Web connectivity Ranking
Web crawlers
HITS and PageRank
|
Baeza-Yates Chapters 2 and 8
Kleinberg: HITS
Brin and Page: PageRank
Brin and Page: Google
Wired article on Google
|
HW3 |
| Mar 28 |
|
| Apr 2 |
The Cloud |
Utility computing model AWS basics; EC2+EBS |
Armbrust: A view of
Cloud Computing |
|
| Apr 4 |
Transactions |
Application server and TP monitor architectures
ACID properties Two-phase commit |
Tanenbaum chapters 8.5-8.6 |
HW3 due (on Apr 5) |
| Apr 9 |
Fault tolerance |
Replicated state machines Consensus; Paxos algorithm Rational
behavior and Byzantine faults |
Lamport: Paxos
(Alternative version)
Schneider: State Machine Approach
|
|
| Apr 11 |
Security |
Web security
Views, ACLs, capabilities; crypto basics
Kerberos; TLS
|
Tanenbaum chapter 9 |
Project proposal due |
| Apr 16 |
Incremental processing |
Bigtable Percolator |
Peng and Dabek: Percolator |
|
| Apr 18 |
Special topics |
Accountability Differential privacy |
|
|
| Apr 23 |
Second midterm |
| |
Project demos and reports |
|
|
|
|
|
| Previous versions |
|
Spring 2011
(taught by Zachary Ives
until 2010)
|