CSE 494/598 Information Retrieval, Mining and Integration on the Internet

Instructor: Subbarao Kambhampati

Next offering: Fall 2011 (T/Th 4:30--5:45pm, BYENG 270 (NEW BIGGER ROOM))

Notice: If you are trying to get into either 494 or 598 sessions of this class for Fall 2011, please show up on the first day. I will not be able to give any individual overrides at this time. My experience however is that there is a lot of churn over summer and everyone who wants a seat typically gets one when the dust settles down..

This course is geared towards exposing students to some of the core technologies for controlling and using the content on the Internet. The following are some of the questions we will consider:

How do search engines work? Why are some pp better than others?
Can we think of the web as a big database/knoweldge base and support efficient database style query processing?
Can we find useful pearls and patterns in the mass of accessible data on the Internet?

This course will be breadth-oriented introduction to the issues involved in answering these questions.

Prerequisites: CSE 310 required. Other courses that will help include CSE 471 (AI) CSE 412 (Databases) and CSE 450 (Algorithms). I am hoping that students have had at least one of these 4-level courses already, but won't insist on them. Students planning to register for this course are encouraged to talk to the instructor (via email at rao wholivesat asu dot edu).

Grading: The grading will be based on class participation, exams and projects.

Textbooks: There is no prescribed textbook. We will read papers (see the reading list.)

Overview: The best overview is the list of topics and lecture notes from the previous offering (shown below).

Additional pointers:

Check out what students say about the last offering of this course in CEAS student evaluations.
Check out what students "learned" from the last offering of this course (in their own words).
Check out what earlier students did in the last offering of the course
Check out what some of the earlier students are doing now.

Lecture Notes & Audio & Video (Spring 2010)

Introduction [Jan 19, 2010]
- L1. Audio of the lecture 1. Introduction to the course..
Course Overview + Big themes [Jan 21, 2010]
- L2 Audio of the lecture 2. Overview of the course--phrased as adapting Information Retrieval, Information Mining, Social Networks and Data Bases to the challenges of Web. Four big ideas (and water started flowing upwards according to Google).
Information Retrieval
Indexing and Tolerant Dictionaries
- L6 Audio of [Feb 4, 2010] Review/sanity check on similarity metrics. Efficient indexing and retrieval. The idea of inverted index (and the use of positional information in the index). Approximate ranking (or realizing that no one really cares even if you get the ranking between 10000th and 10005th entry wrong). Distributing the main index when it is too large to fit on a single m/c
- L7 Audio of [Feb 9, 2010] Tolerant dictionaries and how mis-spelled/mis-typed words can be corrected on the fly. k-gram distance model; edit-distance model; bayesian perspective (and the "fallacy of prior"). Start of improvements to vector space similarity.
Correlation analysis and Latent Semantic Indexing (Annotated version of the matlab session playing with SVD is here.).
Doing IR on Web: Anchor Text; Page Importance Measures
Social networks and their applications on the Web
- L14 Audio of [March 5, 2010] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Efficient computation of pagerank (and how it is important to not represent the M* matrix explicitly given that it is not sparse); doing block-based pagerank iteration to avoid thrashing, the use of asynchronous pagerank iteration to improve convergence.
  Social networks start. Connections between link-analysis and the general field of social network analysis. Some iconic examples of social network analysis (Typhoid Mary, Patient Zero, Web graph (and its small-world nature), Offcial florida ballot viral spread, Saddam capture, Aardvark acquisition). Applications of social networks. Graph-based representation and analysis of social entworks. Measures of influence and centrality. Smallworld phenomena--and their examples in kevin bacon game and erdos number.
- L15 Audio of [March 9, 2010] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Milgram experiment; six-degrees of separation; (uniform) random networks and their properties; realizing that the small world probability increases sharply to 1 right near k=1; where k is the average (expected) degree of the random network. If human networks are (uniform) random, then they will have small-world phenomena (since k, i.e., average number of friends per person, is almost always greater than 1). Trying to confirm whether large-scale social networks are in fact uniform random by comparing their degree distribution to the Poisson degree distribution expected for random networks. Realizing that most real world network degree distributions instead correspond to negative sloping straightlines in log-log space (which means they are of the form P=1/k^r, which is called powerlaws. Discussion of the properties of power-law disributions (which have long tails that fall off only polynomially rather than exponentially). Implications of long tails on everything from probability of existence of such highly-linked sites as google to the ability of making money selling west wing DVDs and iranian classical music CDs on the web. Discussion of generative models which can result in power law distributions over network degrees.
- L16 Audio of [March 11, 2010] (Video of the lecture video part 1 (the first 1hr; battery died after that :( ) Attacks vs. disruptions on powerlaw vs. exponential networks; navigation on social networks; applications of social networks; discussion of trust and reputation; trust rank (a page-rank variant); discussion of social search (and aardvark); discussion of othe powerlaws in cse494--zipf's law; heap's law (and even benford's law).
Topics below this line are not included for Midterm
Crawling & Hardware Issues; Clustering of search results.
- L17 [March 23, 2010](Audio missing)Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Issues in designing a crawler; tradeoff between page importance and page change rate in deciding crawling priority; how Mercator supports politeness and priority; Connectivity server and its connection to inverted indices; start of hardware issues and the role for coarse-grained parallelism in search engines.
- L18 [March 25, 2010]Audio; Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Map-reduce parallelism, and how indexing can be done in map-reduce format. Clustering start. Examples of search results clustering. Computing cluster descriptions using tf/icf measures. Impact of "feature representation", "distance measures", cluster tightness and inter-cluster distance on clustering. Internal vs. external measures of quality for clustering.
- L19 [Mar 30, 2010]Audio; Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Internal and external measures of evaluting clustering; on the number of different k clusterings over n data points (and the realization that the full optimization can be computationally hard); distance measures for text clustering, and the useful properties of dot product distance metrics (such as cosine metric); local-search algorithms for clustering; k-means clustering--example, the tradeoffs in synchronous (centroid computed only once per iteration) and asynchronous (centroid re-computed everytime a change of membership occurs); explaining that given K, we only need to focus on intra-cluster tightness; G-measure which is the sum of squared distances of each point from the centroid of the cluster it belongs to; noticing that k-means comes to a minimum on G measure--but it can be local minimum; the discussion that one way to make k-means globablly optimal would be to allow simultaneous switch of sets of points together; alternative idea of considering multiple initial centroids and seeing which gives a better G measure. Problems with K-means--deciding on K using the knee points in the plot of G vs. k.
- L20 [Apr 6, 2010]Audio; Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) More discussion on K-means and its limitations; "semi-supervised" clustering--where you are given "must-link" and "must-not-link" information. Hierarchical clustering methods--HAC and Bisecting K-means; discussion of how the inter-cluster distance definition changes the way HAC produces clusters. Buckshot and bisecting K-means as hybrid methods. Start of classification--and discussion of Rocchio Relevance Feedback as a classification technique.
Text Classification
- L21 [Apr 8, 2010]Audio; Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)More on Rocchio Relevance feedback; long discussion on why rocchio and k-NN ideas are not good enough for classification (and how classification requires clusters to respect the labels--which may make the clusters not have sufficient internal quality); general classification methods, and why Naive Bayes is a good approach for text classification; discussion of Navie bayes and how it gets by with fewer learned parameters by making strong independence assumptions (and how those are also the same independence assumptions made by the vector similarity model)
- L22 [Apr 13, 2010]Audio; Video of the lecture video part 1(only first 30 min recorded due to battery failure). Naive bayes discussion continued; need for laplacian smoothing; unigram model of text; feature selection using mutual information; example of using mutual information to rank features; realization that MI-based approaches tend to focus on individual features; connection to dimensionality reduction which finds linear (or non-linear) combination of features that are most relevant. Start of recommendation systems.
Recommendation Systems
- L23 [Apr 15, 2010]Audio; Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)Recommendation systems. Content vs. Collaborative. Discussion of the collaborative filtering algorithm. Relative advantages of content vs. collaborative filtering approaches; a way to combine them by using content-based filtering to handle the sparse ratings problem. Adapting that idea to content-based filtering itself and using unlabeled examples to improve the performance of a classifier.
Specifying and Exploiting Structure
- L24 [Apr 20, 2010]Audio; Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)Discussion of item-centered approaches for collaborative filtering (and the connections to association and scalar clustering); discussion of LSI-style approaches to collaborative filtering, and their connection to the NETFLIX prize winner; the technical challenge of doing LSI in the presence of null values (unrated items)--two methods--imputation and going with a dot product on just on the co-rated items; the issue of over-fitting associated with the latter idea, and handling the overfitting through regularization term as part of low-rank approximation.
  Overview of the part 3 of the course which is about specifying and exploiting structure--and how XML and Information Extraction fit in the "specifying/extracting structure", how Xquery and structure-exploiting query processing fit into "exploiting structure" and how the semantic web fits into "integrating structured sources". (Important: make sure you get this discussion as it lays out the overview of the rest of the semester)
- L25 [Apr 22, 2010]Audio; Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)XML--motivations for the standard, a little on the syntax, view of XML from IR and DB, path-querying of XML from IR side; XML as a data model and as an "exchange language" for data.
- L26 [Apr 27, 2010]Audio; Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)XML from Database side--XML DTD, XML Schema, Xquery and its use cases; comparing XQuery and SQL; challenges in putting an XML front-end to relational databases--tagging and converting Xquery to SQL. XML and meaning--"Schema-abiding XML as grammatically correct text, which needs semantic (meaning) constraints". Mapping tags and the semantic web dream.
- L27 [Apr 29, 2010]Audio; Video of the lecture video part 1 (the first 44min; the rest didn't get recorded :-( ) Starting Semantic web--and motivation through Ramayana; Motivating RDF and OWL standards by viewing RDF as providing the "base facts" in the text, and "OWL/RDF-SChema" as providing the background knowledge. Examples of RDF and OWL standards; critical discussion of expressiveness tradeoffs made by RDF and OWL. Use cases if semantic web standards.
  Information extraction as a way of converting text into RDF-triples stage (which is also equivalent to populating a database given some text). Why it is different from full-blown NLP (if IR is a poorman's NLP, IE is a middle-class mans NLP..), why we think it is possible (especially the unwrapping issue).
Information Extraction
- L28 [Apr 30, 2010]Audio; Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)Information extraction; its history; how it differs from NLP, the spectrum of approaches for information extraction and how the scope of the tasks (e.g pattern scope, unary vs. n-ary relation extraction etc) make the task harder; three specific techniques--wrappers/unwrappers that use path expressions on the dom tree to extract information from highly templated text; extractors that use hyponymy patterns on parts-of-speech parse trees to extract general facts; and semtag/seeker which proposes a tag beureau.
Information Integration
Sayanora
- L29 [May 4, 2010]Audio; Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)Information Integration: A brief introduction of the spectrum of approaches and methods; a short discussion of the sourcerank work; Ending.
  - Additional information on information integration can be found here: Information Integration on the Web (AAAI 2007 Tutorial with Craig Knoblock)
What students seem to remember

Linear Algebra Review (by Sushovan De on [Feb 10, 2010])

5/11: Final 9:50 - 11:40 AM

Last modified: Mon Aug 15 09:03:52 MST 2011

CSE 494/598 Information Retrieval, Mining and Integration on the Internet

Instructor: Subbarao Kambhampati

Next offering: Fall 2011 (T/Th 4:30--5:45pm, BYENG 270 (**NEW BIGGER ROOM**))

Lecture Notes & Audio & Video (Spring 2010)

Next offering: Fall 2011 (T/Th 4:30--5:45pm, BYENG 270 (NEW BIGGER ROOM))