This problem set has seven problems for a total of 100 points. Some problems have several parts.
This assignment is to be handed in at the beginning of class on paper, not submitted by email.
The instructions to "answer in about X words" are guidelines. You are not required to write any particular number of words, but we expect that complete answers will be about that long.
Visit the epidemic simulation provided by Sangkyum Kim: http://www.cis.upenn.edu/~sangkyum/applet_codes/epidemic/Epidemic.html
The simulator draws a two dimensional lattice. Each square represents a node in the network. Each node has eight neighbors to which it is directly connected. A non-zero rewiring probability allows for a chance that a node is also connected to some other random node on the graph.
("Random seed" is just a number that is used by the applet in a function that generates pseudo-random numbers. You can type some arbitrary number in there, if you like.)
Set the model speed to "hyper" and run thirty simulations with a fixed infection probability of .19. Only the rewiring probability will vary during your experiments.
Run five simulations for each of the following rewiring probability values: 0, .2, .4, .6, .8, 1.
Record the "generations" value (in the upper right) for each of the twenty thirty experiments, labeling which ones came from each of the probability values. Graph the mean (average) values for 0, .2, .4, .6, .8 and 1. You can use a graphing program but also feel free to use pen or pencil.
Answer (5 points possible): Example: (0) 1, 1, 14, 4, 2 (.2) 45, 40, 62, 13, 4 (.4) 1, 70, 1, 5, 1 (.6) 1, 65, 37, 2, 1 (.8) 51, 14, 36, 11, 39 (1) 3, 1, 5, 51, 36. Means: 4.4, 32.8, 15.6, 21.2, 30.2, 9.6. (I omit the graph here.) Recording the data is worth two points, graphing it correctly is worth three points.
Your observations may lead to you conclude that there a tipping point for this system. If so, what estimate can you make for when the tipping point for this system occurs, given the experiments you have done? If you conclude that there is no tipping point, write "none."
Answer (5 points possible): Two answers are appropriate: either "none can be determined" or "it occurs somewhere between 0 and .4." It's possible that there is a tipping point in something other than the mean (e.g., the variance) or that a tipping point for the mean will become clear after many more values are sampled, but it is currently swamped by the large variance. Answers of "none" get full credit, as do any answers that indicate a point that seems evident from the data collected. Any other answers that show knowledge of what a tipping point is get up to three points; the other two points can be obtained in these cases by mentioning the difficulty with our small number of samples.
Why does this system have a tipping point, or why not? Be precise. If you think it has a tipping point, explain what change in the structure of the system would eliminate that tipping point. If you think it doesn't, explain a structural change that would cause the system to have one.
Answer (5 points possible): Either the system does not currently exhibit a tipping phenomenon or else these 30 experiments are too few to reveal where that point is. Structural changes that might introduce one include increasing the density of connection (for instance, by moving from a 2D square grid to a 2D hexagonal tiling, or by creating a 3D volume of cubes) or introducing constraints on how random links are selected (for instance, by selecting them using preferential attachment, so that some individuals become high-degree "connectors"). Any answers that indicate an understanding of what network structure is (as opposed to suggesting changes in the infection probability parameter) get full credit.
Is the number of turns (or "generations") that the disease spreads really strongly correlated with whether or not there is an outbreak? What other metric could we use to determine whether or not an outbreak has occurred?
Answer (5 points possible): There is a fairly strong correlation, but it does not perfectly track the percentage of the population infected. For instance, the run at .2 rewiring that went for 62 generations clearly infected less that half the population, while the runs at .6 that went for 37 generations and at 1 that went for 36 generations clearly infected more than half. If we define an outbreak in terms of percentage of the population infected, we should directly count the percentage of the population infected. However, the applet does not display this statistic, while it does provide the "generations" value. Another reasonable way to define it might be in terms of the spatial coverage of the disease. Indicating that generations is correlated gets two points; If the answer mentions one of the above alternative statistics or any other reasonable statistic, it gets two further points. To obtain full credit, an answer must include why counting the generations might be problematic.
Describe a statistical process by which large networks are generated that might lead to a heavy-tailed distribution of degrees (in which most vertices have relatively small degree, but a small fraction have inordinately large degree). This may be an informal English description of how new vertices are added, and the process by which they acquire links to other vertices, or it may be a more mathematical, formal description (e.g. an algorithm for generating networks). In either case you should strive to be as precise as you can. Try to make your process as decentralized as possible. You should then argue, again as precisely as possible, why your process generates heavy-tailed distributions, and discuss what types of natural networks it might model well.
Answer (15 points possible): One example is Barabási and Albert's preferential attachment (called "preferential growth" by Watts). In English, one version of this is as follows: Begin with a small, fully connected graph. Add new nodes as follows: when adding each node, connecet it, using one link, to some existing node that is already in the graph. Pick the connection point so that a node with X existing connections (that is, a node of degree X) is X times as likely to be chosen as is a node with one existing connection. (Initially, every node will be equally likely, but this will change after one node is added.) This is decentralized since only the choice of where the connect each new link is necessary; it does not require global changes to the graph. It generates a heavy-tailed distribution since low-degrees nodes (e.g., degree 1 nodes) are unlikely to be connected to in the future, leaving a heavy tail of many of them that will continually grow. It would work well as a model (for example) of a citation network, where well-known, older, influential papers are very likely to be cited as new papers are added, but many of these new papers never end up being cited. (Since we allow only one link, we really would have to say that it models a special form of the citation network where each paper can only cite one other — perhaps it is the "main citation" in any paper.) Any answer that actually clearly proposes any statistical process (centralized or otherwise) gets seven points just for doing that. If a process is mentioned vaguely or suggested, partial credit can be obtained here. If the process is decentralized (can be done a node at a time without global direction), another point is awarded. If there is an argument about why it generates a heavy-tailed distribution, two points; a persuasive argument gets two more points. Naming one or more natural networks gets the remaining three points.
Consider a network in which person A has ten friends, the ten friends of person A each have ten friends, and those ten friends each have ten friends. There are no other people in this network besides the ones that have been mentioned. We assume that people cannot be their own friends.
What is largest number of people who could possibly be in such a network? Describe the structure of this network.
Answer (5 points possible): assuming friendship is not required to be symmetric, 1111. This happens when the network is a tree: A has 10 friends; those 10 all together have 100 "new" friends not already in the network, and those 100 have 1000 "new" friends. If friendship is symmetric, each node except for person A can have at most 9 "new" friends — one of the friendship links must go "backwards" toward A — so the answer is 1 + 10 + 90 + 810 = 911. In either case, the structure is a tree. Two points for (either of the two) right numbers; three points for "tree."
What is the smallest number? Describe the structure of this network.
Answer (5 points possible): 11. This is a clique (every node is connected to every other node directly), like in high school. Two points for the right number, three points for describing the right structure.
Before consulting any Web resources on the subject, list the ten universities that seem to you to be most similar to Penn. This is "your list."
Now, visit TouchGraph's GoogleBrowser: http://www. touchgraph.com/TGGoogleBrowser.html Enter "www.upenn.edu" as your stating URL. You should also visit Google and simply type "related:www.upenn.edu" as your search query. The GoogleBrowser and the "related" search on Google.com are giving you two different visualizations of the same information that is in Google's database.
Take a look at the top ten Universities whose Web sites seem to be the most related to Penn's, according to Google's analysis. (This is probably easier in Google itself, since results are listed in order of similarity.) The ranking of "related" pages in Google's database is "Google's list" and can be observed several ways, but just look at the Google search results for "related:www.upenn.edu."
Pick two or three of the most surprising differences between the top results on both lists, that is, "your list" and "Google's list." What factors could account for these differences? (Looking at the GoogleBrowser may help you to determine this, since it gives some visual information about who links to who.) What things were you thinking about when you wrote your list, and what factors are important to Google's analysis?
Answer (10 points possible): The answer should reflect the fact that Google considers only Web sites (not geographical location or any other higher-level knowledge about universities) and that links on the Web, which are used in Google's measure of "similarity," are not always arranged to match our intuitive definition of "similarity." For instance, Drexel may be "similar" to Penn because students from one university sometimes take courses at the other, and this leads both students and faculty to create hypertext links between the pages of the two universities. Also, both schools are in Philadelphia and pages that list local universities will link to both, while the school's pages will link to common local events and directories. (This was also probably a factor in Temple's similarity.) On the other hand, the fact that Penn and Dartmouth are in the same athletic conference (the Ivy League) may create a mental association between the schools, and a few links related to news items and sports activities, but that by itself probably doesn't result in as many links. For full credit, an answer should note that Google only pays attention to the text and links of Web sites and describe some reason that link structure might differ from a person's idea of similarity.
Pick a university on your list that was not "similar" to Penn according to Google. Find a path of similarity, either using the TouchGraph GoogleBrowser or using Google itself, between Penn and that university. Write down the path that you found. How long did it take you to find it? What strategy did you use to find that path?
Answer (5 points possible): I thought Princeton would be somewhat similar, as it is a nearby Ivy League school from which we recently raided a provost to become our university's president. But Princeton was not one of the top 10 search results on Google's "similar" list. From Google's list of pages similar to Penn's, I chose Harvard's. Harvard is the other Ivy League school on the list, and one that shares several other features with Princeton that are more specific that general academic reputation and membership in the Ivy League: it is a private university in the Northeast, has good name recognition, and hosts a major university press. Princeton was #10 on the list of "related" results for Harvard. The point here is to reflect on how you do search on this type of graph. Any answer that describes a process of doing this search, and that actually gives the path discovered, gets five points.
Hasbro developed an electronic game called Pox a few years ago. People marketing the game went to schools in Chicago and conducted research in an attempt to determine which kids were coolest. They asked kids to rank others in terms of coolness, asked the coolest of these to name the people they thought were the coolest, and so on, until they found the "top dogs" who were the seen as the coolest kids. Then they gave free units of their game to these "top dogs." Leaving aside for a moment the social atrocity that the company committed, what are some reasons that this marketing strategy be ineffective in transmitting Pox, given Gladwell's discussion of the way that product "epidemics" spread?
Answer (10 points possible): Some examples that could be the basis for an answer: For one thing, being "coolest" means you're admired, but does not particularly make you a maven, connector, or salesman. Also, the "top dogs" may have one or more of these three roles, but they may not be linked to the target community that would actually buy the toy — if it's a toy that geeks like, giving it to a cool football-player-type guy isn't going to help sell it. Also, the toy may not be sticky, or it may not be the type of thing that mavens would personally recommend (even if you can see cool kids playing with it), and personal recommendations may matter for kids, too. There are probably five or ten other good answers to this question out there. An answer that accurately characterizes Gladwell's discussion in The Tipping Point and relates it to the Pox situation gets five points. Critically thinking about the Pox situation and describing a possible "failure mode" for this type of marketing is worth the additional five points.
A few years ago, almost overnight, Napster became the biggest peer-to-peer file sharing network in the world. However, soon after the company was bought by Bertelsmann it was shut down for legal reasons. (A new service has relaunched under the Napster name.) Now Kazaa (using a Gnutella-like scheme) has taken over and seems to have remained successful, despite similar challenges. Outline the network dynamics and statics of both networks. What are the fundamental differences and where do they overlap? In what sense is Kazaa's network structure responsible for its long survival? (This exercise might require some online research on the different file sharing companies and technologies.)
Answer (15 points possible): Both Napster and Kazaa use (or used) peer-to-peer filesharing, so that files are exchanged between user's computers, not uploaded to and downloaded from a central server. This is in contrast to the way the Web works; that system relies on Web servers. Both types of network architecture can be interesting to consider, but they are fundamentally different. Within peer-to-peer architectures, Napster and Kazaa differ. Napster relied upon a centralized directory server that (a) provided legal evidence that copyright violations were occurring on the system, and (b) served as a single point of failure, both technologically and legally. Kazaa distributed its directory services, resulting in slower and less reliable searches but answering the problems assocaited with (a) and (b). Correct description of a peer-to-peer system and noting that both systems are peer-to-peer earns seven points. Noting the difference in directory services and pointing out Napster's single point of failure earns the other eight points. Partial credit can be obtained for less complete answers. Not mentioning the network structrues of these two systems at all results in at most two points total.
For each of the following general categories of networks discussed in class, give a real-world example of such a network other than those given in class. You should be as clear as possible about what the vertices (nodes) are and about what relationship between two vertices determines whether or not there is an edge (link) between them. For each of your networks, comment in a sentence or two on its static structure, its dynamics, and its formation:
Answer (3 points possible for each): Example, for Social Networks: Consider, in honor of Valentine's Day, the social network of romantic attraction, with people as vertices, attraction relationships as edges. We are painfully aware that this is a directed graph; it is possible to be attracted to someone who is not attracted to you. The graph need not be connected and there may even be people (e.g., solitary monks) who have in-degree and out-degree of zero. New attractions can form when people meet one another, and it is probably the case that people are more likely to be attracted to people who are generally attractive (have high in-degree). Naming a network not discussed in class gets one point, commenting on it clearly (or giving a good, new comment on an already-discussed network) earns two. Lack of clarity in discussing edges and vertices results in the loss of one point.