Problem Set 2

CSE 112: Networked Life, Spring 2004

Handed out Feb 17 - Due 9:00 am Feb 26

This problem set has seven problems for a total of 100 points. All problems except the first have several parts. As before, this assignment is to be handed in at the beginning of class on paper, not submitted by email.

  1. (10 points)

    You know of the Friendster network from class. Orkut is another network that is similar in many ways, but currently, membership is by invitation only. You must indicate that someone already in the network is your friend in order to join. A large number of people joined right after the beta version of the service was announced. Until late January is was impossible to "unfriend" someone — you could not remove your link to someone whom you had previously indicated as a friend. You also cannot could not close your account once you had joined.

    Given only the information above what is known about the minimum degree of the graph? What is known about the maximum path length? Briefly justify your answers. What we can say for certain is this: the minimum degree is at least one, and the maximum path length is finite. In both cases, this is because the graph is connected. To join, you must be connected to an existing node, and links and nodes cannot be removed. These statements are adequate for full credit. We could also say, more specifically, that the maximum path length does not exceed the total number of people who joined minus one. Finally, we could say that the minimum degree is not more than the total number of people who joined minus one.

  2. (14 points)
    1. Describe an algorithm which generates an Erdös-Renyí random graph with n vertices and m edges. The algorithm you are to write must begin with N disconnected vertices and add only one edge at a time. The class of random graphs that are generated by this process should be one in which the distribution of degree is uniform and every possible link is equally likely. No real programming language syntax is required in your answer, although you can use pseudocode if you like. Just make it completely clear what is to be done at each step. You should end up with a set V of N vertices (which will be the same one that you start with) and a set E of m edges. If your algorithm can possibly add the same edge twice or can add a loop (an "edge" between a vertex and itself), it isn't correct. Several correct answers are possible, e.g.: [Example 1] Begin with V containing the N vertices, E empty, and the set P which initially contains all choose(N,2) sets of two vertices — defining all possible edges. For m steps, remove an element from P uniformly at random and add it to V. [Example 2] Begin with V containing the N vertices and E empty. Until there are m elements in E, do the following: Pick vertex a uniformly at random from V, and pick vertex b uniformly at random from V. If a=b (that is, they are the same vertex) or the edge (a,b) is already in E, do nothing; otherwise, add the edge (a,b) to E. [Example 3, just for fun] Begin with V containing the N vertices, and E containing all choose(N,2) sets of two vertices — (V,E) is a complete graph. For (choose(n,2) - m) steps, remove an element from E uniformly at random.

    2. Simulate your algorithm for N=6 and m=10. You can use random.org to get random numbers. Write down what the graph looks like in all eleven configurations: m=0 ... 10. E.g., edges could be added in this order: (1, 6) (1, 4) (2, 5) (1, 5) (3, 6) <- this edge connects the graph (1, 3) (5, 6) (3, 5) (3, 4) (4, 5) The point of your doing the simulation is to make sure that your algorithm above actually works. Some algorithms are easier to simulate than others; you can request all the numbers from random.org at once if you use the algorithm in example 2, for instance, while you will have to make requests one at a time using the algorithms in example 1 or 3.

    3. Is your graph connected? If so, at what step did it become connected? In the example above, yes; the fifth edge connects it. It is extremely unlikely, but possible, to generate a graph that is not fully connected.

  3. (20 points)

    Generate a graph using the Barabási-Albert preferential attachment model, which Watts calls preferential growth (see pages 108-111). Begin with a pair of nodes with a single link between them. Go for eight steps. At each step, add one node and a single link connecting it to the previous graph.

    1. Draw each step, writing down the appropriate probability next to each node to indicate how likely it is that the new node will attach there. (Note: When you have a set of preferences for different options, for instance, things you prefer with weights 5, 2, and 1, you convert them into probabilities by summing them, 5+2+1 = 8, and dividing each one by the sum: 5/8, 2/8, 1/8. Again, you can use random.org as a source of random numbers.) One way to use random.org to do this involves numbering "both ends" of all the edges, assigning the numbers 1, 2, 3, ... 2*m. In the following picture these numbers are shown in red.

      10-vertex graph

      Then generate a random number in the range 1 to 2*m, where 2*m is the highest one of these red numbers. Now add a new node and connect it with a new edge to the node next to the random number just drawn. The probability of each node to be connected to the next node is obtained by dividing its degree by twice the number of edges in the graph.

      10-vertex graph

    2. What is the degree sequence of your final graph? There are 2 nodes with degree 3, 2 nodes with degree 2 and 4 nodes with degree 1, so the degree sequence is 3, 3, 2, 2, 1, 1, 1, 1.

    3. What is the expected degree sequence of a graph with this number of edges and this number of vertices that is generated in this manner?Approximately 2.9326, 2.9326, 1.9551, 1.5641, 1.3406, 1.1917, 1.0833.

  4. (12 points)

    Begin by drawing a regular graph: arrange 12 nodes in a circle and connect each node to the four nodes nearest to it — that is, the two nodes to one side of the node and the two nodes to the other side along the circle.

    1. Compute the average distance between all pairs of different nodes. As the graph is symmetric, it is enough to compute the average distance only from one node to all the others. (1+1+2+2+3+3+3+2+2+1+1)/11=21/11 (In the graph below the red numbers display the distance of each node to vertex 1)

      10-vertex graph

    2. Make a single rewiring which minimizes the average distance. One correct answer is to remove edge (1,2) and add an edge (1,7).

    3. Why is it now much harder to compute the average distance on this modified graph?By rewiring the graph we destroyed the symmetry.

  5. (12 points)

    In this question, we consider the World Wide Web as a graph, with HTML Web pages being vertices and hyperlinks being edges. For this problem, we'll consider only HTML Web pages (not Flash files or PDFs or anything else) as vertices and only plain, static hyperlinks (not Google's search facility or any other fancy way of linking) as edges. If a page A has one or more hyperlinks to page B, we consider that there is a single directed edge from page A to page B; it does not matter whether there is one link or ten. We'll ignore any links from a page to itself, and any links that lead to or from things other than HTML Web pages.

    1. Is the graph of the Web, as described above:

      • complete? No, this would only be the case if EVERY Web page had a link to EVERY other Web page. This is certainly not the case.
      • acyclic? No, Nick's Homepage links to Michael Kearns' Homepage and vice versa. Hence, there is a cycle.
      • of uniform out-degree distribution, over the entire range of possible out-degrees? No. It would take some serious work, and a large database of explored Web page, to try to determine what the out-degree distribution really is, but the distribution definitely isn't uniform. Consider a Web page that has the maximum number of hyperlinks of all pages on the Web; call this number X. If out-degree distribution were uniform, there would be as many pages with X links as there were pages with 15 links (e.g., Kilian's home page and many other home pages) or 3 links (such as many pages that only have "previous / up / next" links). And there would be about as many pages with X-1 links, and about as many with X-2, etc. This would be pretty outrageous, since there are many, many pages with small numbers of links and not so many with huge numbers.
    2. One page on the Web, and one node in our graph, is this homework page itself: http://www.cis.upenn.edu/~mkearns/teaching/NetworkedLife/homework2.html

      • What is this node's out-degree? 1 (Note that this page has a different number of links since we added some in the solution!)
      • What is its in-degree? Unknown, but at least 1, since the main Networked Life page links to it.
      • How did you determine its out-degree and in-degree? Out-degree: Look at the page, or look at the HTML source if you like. There are links to random.org. A search for "href" on the HTML will confirm that these are the only links. In-degree: There is no certain method, but we got here from one link, so we know there is one. You could Google for "link:http://www.cis.upenn.edu/~mkearns/teaching/NetworkedLife/homework2.html" to see if there are other link's in Google's database, but there probably aren't; and whatever the results there, there may be others (e.g., public "bookmark" pages made by students).
    3. You will now do a search on the Web graph, beginning at the node above. You are to only use local information (the HTML page you are examining at each step) to determine where to go next. Do no use Google or any other Web resource to direct your search: simply pick the link that seems most likely to take you to your destination, while looking only at the current page. Your target is the official Web site of Janet Jackson.

      For ten steps, or until you reach Janet-Jackson.com, if you reach the site in fewer than ten steps, record all of the following information:

      • The URL of the link you followed.
      • A phrase or sentence explaining why you followed that link.

      Note that you do not need to reach Janet-Jackson.com to get full credit for this part of the problem. You just need to record all the links that you followed and why you followed them. If you hit a dead end of any sort or seem to be on an unhelpful path, you can backtrack and start from an earlier point on your path; just note clearly that you are doing this. This is an example (note that I mistakenly started from the solution page rather than the homework page, but this still provides an example) — http://www.cis.upenn.edu/~mkearns: I know Prof. Kearns' home page has many links to different organizations and articles. I followed a link to Time Digital hoping to find a news item with a link: http://www.time.com/time/digital/magazine/articles/0,4753,45789,00.html. The article had expired, but there was a link to the main Time page: http://www.time.com/. I clicked on "Entertainment" looking for Janet Jackson news: http://www.time.com/time/arts/. Score! She's currently the top story, in an article named "The Hypocrisy Bowl." I click the link to that story: http://www.time.com/time/magazine/article/0,9171,1101040216-588871,00.html. Unfortunately, no link here, but I can skip to page 3 hoping that there are some external links at the end — I'm worried that this corporate media site may have few or no external links now: http://www.time.com/time/magazine/article/0,9171,1101040216-588871-3,00.html. Sure enough, no external links. I now think it will be impossible to leave the Time site through a link, so I return to Prof. Kearns' home page. Despite just getting burned, I go to http://www.nytimes.com/library/tech/00/02/circuits/articles/10matc.html since I know The New York Times does have external links. I click on "Home" to go to http://www.nytimes.com/yr/mo/day/ and there's an article "My Hero: Janet Jackson" http://www.nytimes.com/2004/02/15/arts/15RICH.html. Again, I have to click on page 2 to get to the end of this, where the links usually are: http://www.nytimes.com/2004/02/15/arts/15RICH.html?pagewanted=2. But there are no links there. And that was the 10th link I followed, anyway.

  6. (12 points)

    10-vertex graph

    Consider this graph:

    1. How large is a maximum clique? Nodes 3,4,7,9,6 describe the only maximum clique of size 5.
    2. What is the average degree? (1+2+5+6+3+4+7+3+4+1)/10=3.6
    3. Write the numbers of the nodes that are in a maximum matching. One maximum (in fact perfect) matching is (1,3);(6,9);(7,10);(4,8);(2,5);
    4. What is the chromatic number? From part (1) you know that this graph has a 5-clique and hence a chromatic number of at least 5. That it is in fact 5 can be checked with the applet mentioned in class:

      Screenshot displaying that the chromatic number is indeed 5

    5. What vertices are in a maximum independent set? One solution would be: 1,9,2,8,10
    6. Draw a spanning tree of this graph.

      spanning tree

  7. (20 points)

    Consider the two "small worlds" network generation models we examined in class — the α-model of Watts, and the model of Kleinberg in which vertices lie on a grid, and a vertex is connected to all others within grid distance p, as well as having q "long distance" connections. A vertex at grid distance d has probability proportional to 1/dr of being chosen as a long-distance connection, for some r > 0.

    1. Discuss how the Erdös-Renyí model can be approximated in both the α-model and Kleinberg's model. What settings of the parameters of each model realizes this approximation? Briefly justify your answer.
      • For VERY large α, the probability that two nodes are getting connected is independent of the number of common friends they have (Solaria World) and is approximately uniformly random.
      • In the Kleinberg Model with r=0, the probability of some node getting connected with another node that is not a direct neighbor is independent of their grid distance and the probability distribution becomes uniform. Each node is still definitely connected with its direct neighbors (different to the ER model), however with big q, these connections become more and more insignificant.
    2. Briefly discuss the similarities and differences between the α-model and Kleinberg's model. Be as precise as possible, referring to the different roles of the two models' parameters. Both models simulate local preference in social networks. However the α-model has no sense of geographic closeness. The probability of two nodes being connected only depends on their number of mutual friends. The bigger α, the smaller is the impact of the amount of common friends, and the more random is the graph. In the Kleinberg model the probability of two nodes being connected only depends on their grid distance. The bigger r, the bigger is the expected radius within which the q non-direct neighbors will lie.
    3. In the Kleinberg model, briefly explain why the value of r must be carefully chosen in order to permit efficient decentralized navigation. If r is too small, shortcuts do exist but they cannot be found as they are too random. If r is too big, all connections are within small neighborhoods and shortcuts disappear. The goal is to find the sweet spot where both effects cancel each other out (r=2).