Web-Mining Agents Community Analysis Tanya Braun Universitt zu Lbeck Institut fr Informationssysteme Literature Christopher D. Manning, Prabhakar Raghavan and Hinrich Schtze, Introduction to Information Retrieval, Cambridge University Press. 2008. http://nlp.stanford.edu/IR-book/informationretrieval-book.html 2 Retrieving & Ranking Documents With Links Use links As source of information

As means to rank relevant documents 3 Todays Lecture Anchor text Link analysis for ranking in the web PageRank and variants Hyperlink-Induced Topic Search (HITS) Links in topic modeling 4 The Web as a Directed Graph Page A Anchor hyperlink

Page B Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal) Assumption 2: The anchor of the hyperlink describes the target page (textual context) 5 Indexing anchor text When indexing a document D, include anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today www.ibm.com

Joes computer hardware links Compaq HP IBM Big Blue today announced record profits for the quarter 6 WWWW World Wide Web Worm (1994) Oliver A. McBryan, GENVL and WWWW: Tools for Taming the Web, Proceedings of the First International World Wide Web Conference, 1994 7 Analysis of Anchor-based Retrieval

For IBM how to distinguish between: IBMs home page (mostly graphical) IBMs copyright page (high term freq. for ibm) Rivals spam page (arbitrarily high term freq.) ibm.com ibm A million pieces of anchor text with ibm send a strong signal IBM home page www.ibm.com Nadav Eiron and Kevin S. McCurley, Analysis of Anchor Text for Web Search, Proceedings of the Twenty-Sixth Annual International Conference on Research and Development in Information Retrieval (SIGIR), 2003

8 Analysis of Anchor-based Retrieval Better than titles as an indicator Summaries from different contexts and different people Multi-word query matched by anchor text of higher value Own set of stop words IBM home page ibm.com ibm Works well for page-finding queries A million pieces of anchor text with ibm send a strong signal www.ibm.com

Nadav Eiron and Kevin S. McCurley, Analysis of Anchor Text for Web Search, Proceedings of the Twenty-Sixth Annual International Conference on Research and Development in Information Retrieval (SIGIR), 2003 9 Links for Query-independent Ordering First generation: using link counts as simple measures of popularity Two basic suggestions (each page gets a score): Undirected popularity: Score of a page = the number of in-links plus the number of out-links (3+2=5) Directed popularity: Score of a page = number of its in-links (3)

10 Query Processing First retrieve all pages meeting the text query (say venture capital). Order these by their link popularity (either variant on the previous page). 11 Spamming Simple Popularity Exercise: How do you spam each of the following heuristics so your page gets a high score? Undirected popularity: Score of a page = the number of in-links plus the number of out-links Directed popularity: Score of a page = number of its in-links

12 PageRank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably 1/3 1/3 1/3 In the steady state, each page has a longterm visit rate - use this as the pages score Lawrence Page et.al., The PageRank Citation Ranking: Bringing Order to the Web, 1999 13 Not Quite Enough

The web is full of dead-ends. Random walk can get stuck in dead-ends Makes no sense to talk about long-term visit rates ?? Lawrence Page et.al., The PageRank Citation Ranking: Bringing Order to the Web, 1999 14 Teleporting At a dead end, jump to a random web page At any non-dead end, with probability 10%, jump to a random web page. With remaining probability (90%), go out on a random link 10% - a parameter

There is a long-term rate at which any page is visited. How do we compute this visit rate? Lawrence Page et.al., The PageRank Citation Ranking: Bringing Order to the Web, 1999 15 Mathematical Background: Markov Chains A Markov chain consists of states, plus an transition probability matrix . At each step, we are in exactly one of the states. For , the matrix entry tells us the probability of being the next state, given we are currently in state . is OK.

16 Markov Chains Clearly, for all , =1 =1 Markov chains are abstractions of random walks. Exercise: represent the teleporting random walk from 2 slides ago as a Markov chain, for

this case: 17 Ergodic Markov Chains A Markov chain is ergodic if you have a path from any state to any other (reducibility) returns to states occur at irregular times (aperiodicity) For any start state, after a finite transient time , the probability of being in any state at a fixed time is nonzero. (positive recurrence) Not ergodic (even/ odd). 18 Ergodic Markov Chains

For any ergodic Markov chain, there is a unique long-term visit rate for each state. Steady-state probability distribution. Over a long time-period, we visit each state in proportion to this rate. It does not matter where we start. 19 Probability Vectors A probability (row) vector tells us where the walk is at any point. E.g., means we are in state . 1 More generally, the vector means the walk is in state with probability where

=1 =1 20 Change in Probability Vector If the probability vector is at this step, what is it at the next step? Recall that row of the transition prob. matrix tells us where we go next from state . So from , our next state is distributed as . 21 Reaching the Steady State Recall, regardless of where we start, we eventually reach the steady state . Start with any distribution (say ). After one step, were at ;

after two steps at , then and so on. Eventually means for large , . Algorithm: multiply by increasing powers of until the product looks stable. 22 How Do We Compute This Vector? Let denote the row vector of steady-state probabilities. If we our current position is described by , then the next step is distributed as . But is the steady state, so . Solving this matrix equation gives us . So, is the (left) eigenvector for . (Corresponds to the principal eigenvector of with the largest eigenvalue.) Transition probability matrices always have largest eigenvalue 1. 23

Steady State Example The steady state looks like a vector of probabilities : is the probability that we are in state . 1/4 3/4 1 2 3/4 1/4 For this example, and .

24 PageRank Summary Preprocessing: Given graph of links, build matrix . From , compute . The entry is a number between 0 and 1: the PageRank of page . Query processing: Retrieve pages meeting query. Rank them by their PageRank. Order is query-independent. PageRank is used in Google, but so are many other clever heuristics. Lawrence Page et.al., The PageRank Citation Ranking: Bringing Order to the Web, 1999 25

PageRank: Issues and Variants How realistic is the random surfer model? What if we modeled the back button? [Fagi00] Surfer behavior sharply skewed towards short paths [Hube98] Search engines, bookmarks & directories make non-random jumps. Biased Surfer Models Weight edge traversal probabilities based on match with topic/query (non-uniform edge selection) Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest) 26 Topic-specific PageRank [Have02] Conceptually, we use a random surfer who teleports, with say 10% probability, using the

following rule: Selects a category (e.g., one of the 16 top level ODP categories) based on a query & userspecific distribution over the categories ODP = Open Directory Project, now: dmoz Taher H. Haveliwala, Topic-sensitive PageRank, Proceedings of the Eleventh International World Wide Web Conference, 2002 27 Topic-specific PageRank [Have02] Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule: Selects a category (e.g., one of the 16 top level ODP categories) based on a query & userspecific distribution over the categories Teleport to a page uniformly at random within the chosen category

Taher H. Haveliwala, Topic-sensitive PageRank, Proceedings of the Eleventh International World Wide Web Conference, 2002 28 Non-uniform Teleportation Sports Teleport with 10% probability to a Sports page 29 Topic-specific PageRank [Have02] Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule: Selects a category (e.g., one of the 16 top level ODP categories) based on a query & userspecific distribution over the categories

Teleport to a page uniformly at random within the chosen category Sounds hard to implement: cannot compute PageRank at query time! Taher H. Haveliwala, Topic-sensitive PageRank, Proceedings of the Eleventh International World Wide Web Conference, 2002 30 Topic-specific PageRank [Have02] Implementation Offline: Compute PageRank distributions w.r.t. to individual categories Query-independent model as before Each page has multiple PageRank scores one for each ODP category, with teleportation only to that category

Online: Distribution of weights over categories computed by query context classification Generate a dynamic PageRank score for each page - weighted sum of category-specific PageRanks Taher H. Haveliwala, Topic-sensitive PageRank, Proceedings of the Eleventh International World Wide Web Conference, 2002 31 Influencing PageRanks Input: Web graph Influence vector Output: Rank vector

PR = PageRank Taher H. Haveliwala, Topic-sensitive PageRank, Proceedings of the Eleventh International World Wide Web Conference, 2002 32 Interest in Sports Sports 10% Sports teleportation Taher H. Haveliwala, Topic-sensitive PageRank, Proceedings of the Eleventh International World Wide Web Conference, 2002 33 Interest in Health

Health 10% Health teleportation Taher H. Haveliwala, Topic-sensitive PageRank, Proceedings of the Eleventh International World Wide Web Conference, 2002 34 What If Many Interests? Health Sports gives you: 9% sports teleportation, 1% health teleportation Taher H. Haveliwala, Topic-sensitive PageRank, Proceedings of the Eleventh International World Wide Web Conference, 2002 35

Composite Scores For a set of personalization vectors PR ( , ) =PR ( , ) Weighted sum of rank vectors itself forms a valid rank vector, because is linear w.r.t. Taher H. Haveliwala, Topic-sensitive PageRank, Proceedings of the Eleventh International World Wide Web Conference, 2002 36 Query Answering Classify query (+ context) to choose topic(s)

User interaction? Context: search history, words surrounding search phrase Find set of relevant documents Rank them according to composite score given topics Taher H. Haveliwala, Topic-sensitive PageRank, Proceedings of the Eleventh International World Wide Web Conference, 2002 37 Example Query Answering Possible topic weight for query Example Topic distribution

Distribution golf the query golf, with no additional context, No For additional context the distribution of topic weights we would use How to get? is: 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Taher H. Haveliwala, Topic-sensitive PageRank, Proceedings of the Eleventh International World Wide Web Conference, 2002 38 Example Query Answering Possible topic weight distribution for query Picking the Topic Distribution golf If the query is golf, but thefinancial previous query was Context: previous query services

financial services investments, then the investments of topic weights we would use is: Still:distribution How to get? 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 A B u rt s s C i ne om ss pu

te r G s am es H ea Ki lth ds _a Ho nd m _T e ee ns N e R e c ws re

at R e f i on er en c R eg e io n Sc a l ie Sh nce op pi ng So ci et y

Sp or ts W or ld 0.2 0.1 0 Taher H. Haveliwala, Topic-sensitive PageRank, Proceedings of the Eleventh International World Wide Web Conference, 2002 39 Hyperlink-induced Topic Search A H

A A H A A H Hubs Authorities 40 Hyperlink-Induced Topic Search (HITS) In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages:

Hub pages are good lists of links on a subject. e.g., Bobs list of cancer-related links. Authority pages occur recurrently on good hubs for the subject. Best suited for broad topic queries rather than for page-finding queries. Gets at a broader slice of common opinion. Jon M. Kleinberg, Hubs, Authorities, and Communities, ACM Computing Surveys 31(4), December 1999 41 Hubs and Authorities Thus, a good hub page for a topic points to many authoritative pages for that topic. A good authority page for a topic is pointed to by many good hubs for that topic.

Circular definition - will turn this into an iterative computation. 42 The Hope AT&T Alice Authorities Hubs Sprint Bob MCI Long distance telephone companies 43

High-level Scheme Extract from the web a base set of pages that could be good hubs or authorities. From these, identify a small set of top hub and authority pages; Iterative algorithm. 44 Base Set Given text query (say browser), use a text index to get all pages containing browser. Call this the root set of pages. Add in any page that either points to a page in the root set, or is pointed to by a page in the root set. Call this the base set.

45 Visualization Root set Base set 46 Assembling the Base Set Root set typically 200-1000 nodes. Base set may have up to 5000 nodes. How do you find the base set nodes? Follow out-links by parsing root set pages. Get in-links (and out-links) from a connectivity server. (Actually, suffices to text-index strings of the form href=URL to get in-links to URL.)

47 Distilling Hubs and Authorities Compute, for each page in the base set, a hub score and an authority score . Initialize: for all Iteratively update all After iterations output pages With highest scores as top hubs With highest scores as top authorities 48 Iterative Update Repeat the following updates, for all : Hub scores update

h() ( ) x Authority scores update a() h( ) x 49 Scaling To prevent the and values from getting too big, can scale down after each iteration. Scaling factor does not really matter:

We only care about the relative values of the scores. 50 How Many Iterations? Claim: relative values of scores will converge after a few iterations: In fact, suitably scaled, and scores settle into a steady state! We only require the relative orders of the and scores - not their absolute values. In practice, ~5 iterations get you close to stability. 51 Things to Note Pulled together good pages regardless of

language of page content. Use only link analysis after base set assembled Iterative scoring is query-independent. Iterative computation after text index retrieval - significant overhead. 52 Issues Topic Drift Off-topic pages can cause off-topic authorities to be returned E.g., the neighborhood graph can be about a super topic Mutually Reinforcing Affiliates Affiliated pages/sites can boost each others scores

Linkage between affiliated pages is not a useful signal 53 Generative Topic Models for Community Analysis Pilfered from: Ramesh Nallapati http://www.cs.cmu.edu/~wcohen/10-802/lda-sep-18.p pt & Arthur Asuncion, Qiang Liu, Padhraic Smyth: Statistical Approaches to Joint Modeling 54 What if the corpus has network structure?

CORA citation network. Figure from [Chang, Blei, AISTATS 2009] 55 Citations As Links Pioneered by Eugene Garfield since the 1960s Citation frequency Co-citation coupling frequency Cocitations with a given author measures impact Cocitation analysis Bibliographic coupling frequency Articles that co-cite the same articles are related Citation indexing Who is cited by author x? E. Garfield, Citation analysis as a tool in journal PageRank (preview:

Pinski and Narin 60s) evaluation. Science. Nov 3;178(4060):471-9. 1972 G. Pinski, F. Narin. Citation aggregates for journal aggregates of scientific publications: Theory, with application to the literature of physics. Information Processing & Management, 12 (5), 297- 312. 1976 56 Outline Topic Models for Community Analysis Citation Modeling with pLSA with LDA Modeling influence of citations Relational Topic Models 57

Hyperlink Modeling Using pLSA Select document d ~ Mult() For each position n = 1,, Nd d d z Generate zn ~ Mult Generate wn ~ Mult z

w For each citation j = 1,, Ld c N L M Generate zj ~ Mult Generate cj ~ Mult g D. A. Cohn, Th. Hofmann, The Missing Link - A Probabilistic Model of Document Content and

Hypertext Connectivity, In: Proc. NIPS, pp. 430-436, 2000 58 Hyperlink Modeling Using pLSA pLSA likelihood d d z z w c

N New likelihood L M g Learning using EM 59 Hyperlink Modeling Using pLSA Heuristic 0 < < 1 determines the relative importance of content and hyperlinks 60

Hyperlink Modeling Using pLSA Classification performance Hyperlink Content Hyperlink Content 61 Hyperlink modeling using LDA For each document d, Generate d ~ Dirichlet()

For each position i = 1, ... , Nd z Generate a topic zi ~ Mult Generate a word wi ~ Mult z For each citation j = 1, , Lc w c N L M

g Generate zi ~ Mult(d) Generate ci ~ Mult Learning using variational E. Erosheva, S Fienberg, J. Lafferty, Mixed-membership EM, Gibbs sampling models of scientific publications. Proc National Academy Science U S A. 2004 Apr 6;101 Suppl 1:5220-7. Epub 2004 Mar 12. 62

Link-PLSA-LDA: Topic Influence in Blogs R. Nallapati, A. Ahmed, E. Xing, W.W. Cohen, Joint Latent Topic Models for Text and Citations, In: Proc. KDD, 2008. 63 64 Modeling Citation Influences - Copycat Model Each topic in a citing document is drawn from one of the topic mixtures of cited publications L. Dietz, St. Bickel, and T. Scheffer, Unsupervised Prediction of Citation Influences, In: Proc. ICML 2007.

65 Modeling Citation Influences Citation influence model: Combination of LDA and Copycat model L. Dietz, St. Bickel, and T. Scheffer, Unsupervised Prediction of Citation Influences, In: Proc. ICML 2007. 66 Modeling Citation Influences Citation influence graph for LDA paper 67 Modeling Citation Influences

Words in LDA paper assigned to citations 68 Modeling Citation Influences Predictive Performance 69 Relational Topic Model (RTM) [ChangBlei 2009] Same setup as LDA, except now we have observed network information across documents , ,

, ( | , Link probability function

, , , , ,

M K Documents with similar topics are more likely to be linked.

70 Relational Topic Model (RTM) [ChangBlei 2009] For each document d Draw topic proportions For each word Draw assignment Draw word

, , , , , For each pair of

documents K Draw binary link indicator 71 Collapsed Gibbs Sampling for RTM Conditional distribution of each : LDA term Edge

term Non-edge term Using the exponential link probability function, it is computationally efficient to calculate the edge term. It is very costly to compute the non-edge term exactly. 74 Approximating the Non-edges 1. Assume non-edges are missing and ignore the term entirely (Chang/Blei) 2. Make the following fast approximation: ( 1 exp ( ) ) ( 1 exp ( ) ) , where =

1 3. Subsample non-edges and exactly calculate the term over subset. 4. Subsample non-edges but instead of recalculating statistics for every token, calculate statistics once per document and 75 Document networks # Docs # Links

Ave. DocLength VocabSize Link Semantics 4,000 17,000 1,200 60,000 Paper citation (undirected) 10,000

43,000 640 38,000 Common actor/director Enron (Undirec ted) 1,000 16,000 7,000 55,000

Communication between person i and person j Enron (Directe d) 2,000 21,000 3,500 55,000 Email from person i to person j

CORA Netflix Movies 79 Link Rank Link rank on held-out data as evaluation metric Lower is better dtest {dtrain} Edges among {dtrain} Blackbox predicto r

Ranking over {dtrain} Link ranks Edges between dtest and {dtrain} How to compute link rank (simplified): 1. Train model with {dtrain}. 2. Given the model, calculate probability that dtest would link to each dtrain. Rank {dtrain} according to these probabilities. 80 3. For each observed link between dtest and {dtrain}, Results on CORA data Comparison on CORA, K=20 270 250

Link Rank 230 210 190 170 150 Baseline (TFIDF/Cosine) LDA + Regression Ignoring non-edges Fast approximation of

Subsampling nonnon-edges edges (20%)+Caching 8-fold cross-validation. Random guessing gives link rank = 2000. 82 Results on CORA data 650 400 Baseline RTM, Fast Approximation 350 550 500 Link Rank

300 Link Rank Baseline LDA + Regression (K=40) Ignoring Non-Edges (K=40) Fast Approximation (K=40) Subsampling (5%) + Caching (K=40) 600 250 200 450 400 350 300 250

150 200 100 0 20 40 60 80 100 Number of Topics 120 140

160 150 0 0.2 0.4 0.6 Percentage of Words 0.8 1 Model does better with more topics Model does better with more words in each document

83 Timing Results on CORA CORA, K=20 7000 6000 Time (in seconds) 5000 LDA + Regression Ignoring Non-Edges Fast Approximation Subsampling (5%) + Caching Subsampling (20%) + Caching 4000 3000 2000

1000 0 1000 1500 2000 2500 3000 Number of Documents 3500 4000 Subsampling (20%) without caching not shown since it takes 62,000 seconds for D=1000 and 3,720,150 seconds for D=4000 84

Some movie examples 'Sholay' Cowboy Indian film, 45% of words belong to topic 24 (Hindi topic) Top 5 most probable movie links in training set: 'Laawaris 'Hote Hote Pyaar Ho Gaya

'Trishul 'Mr. Natwarlal 'Rangeela Western film, 25% of words belong to topic 7 (western topic) Top 5 most probable movie links in training set: 'Tall in the Saddle 'The Indian Fighter' 'Dakota' 'The Train Robbers' 'A Lady Takes a Chance Rocky II Boxing film, 40% of words belong to topic 47 (sports topic) Top 5 most probable movie links in training set: 'Bull Durham '2003 World Series

'Bowfinger 'Rocky V 'Rocky IV' 88