CSE 494/598 Information Retrieval, Mining and Integration on
the Internet
Next offering: Fall 2011 (T/Th 4:30--5:45pm, BYENG 270 (**NEW BIGGER ROOM**))
Notice: If you are trying to get into either 494 or 598 sessions
of this class for Fall 2011, please show up on the first day. I will not be able to
give any individual overrides at this time. My experience however is that there is
a lot of churn over summer and everyone who wants a seat typically
gets one when the dust settles down..
This course is geared towards exposing students to some of the core
technologies for controlling and using the content on the
Internet. The following are some of the questions we will
consider:
- How do search engines work? Why are some
pp better than others?
- Can we think of the web as a big database/knoweldge base and support efficient
database style query processing?
- Can we find useful pearls and patterns in the mass of
accessible data on the Internet?
This course will be breadth-oriented introduction to the issues
involved in answering these questions.
Prerequisites: CSE 310 required. Other courses that will help include
CSE 471 (AI) CSE 412 (Databases) and CSE 450 (Algorithms). I
am hoping that students have had at least one of these 4-level
courses already, but won't insist on them. Students planning
to register for this course are encouraged to talk to the
instructor (via email at rao wholivesat asu dot edu).
Grading: The grading will be based on class participation, exams and
projects.
Textbooks: There is no prescribed textbook. We will read papers (see
the reading list.)
Overview: The best overview is the list of topics and lecture notes
from the previous offering (shown below).
Additional pointers:
Lecture Notes & Audio & Video (Spring 2010)
- Introduction
[Jan 19, 2010]
- Course
Overview + Big themes
[Jan 21, 2010]
-
Information Retrieval
-
Indexing and Tolerant Dictionaries
-
Correlation analysis and Latent Semantic Indexing (Annotated version of the matlab session playing with
SVD is here.).
-
Doing IR on Web: Anchor Text; Page Importance Measures
-
Social networks and their applications on the Web
- L14 Audio
of [March 5, 2010] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)
Efficient computation of pagerank (and how it is important to not
represent the M* matrix explicitly given that it is not
sparse); doing block-based pagerank iteration to avoid
thrashing, the use of asynchronous pagerank iteration to
improve convergence.
Social networks
start. Connections between link-analysis and the general
field of social network analysis. Some iconic examples of
social network analysis (Typhoid Mary, Patient Zero, Web
graph (and its small-world nature), Offcial florida ballot
viral spread, Saddam capture, Aardvark
acquisition). Applications of social networks. Graph-based
representation and analysis of social entworks. Measures
of influence and centrality. Smallworld phenomena--and
their examples in kevin bacon game and erdos number.
- L15 Audio
of [March 9, 2010] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)
Milgram experiment; six-degrees of separation; (uniform) random
networks and their properties; realizing that the small world
probability increases sharply to 1 right near k=1; where k is the
average (expected) degree of the random network. If human networks are
(uniform) random, then they will have small-world phenomena (since k,
i.e., average number of friends per person, is almost always greater
than 1). Trying to confirm whether large-scale social networks are in
fact uniform random by comparing their degree distribution to the
Poisson degree distribution expected for random networks. Realizing
that most real world network degree distributions instead correspond
to negative sloping straightlines in log-log space (which means they
are of the form P=1/k^r, which is called powerlaws. Discussion of the
properties of power-law disributions (which have long tails that fall
off only polynomially rather than exponentially). Implications of long
tails on everything from probability of existence of such
highly-linked sites as google to the ability of making money selling
west wing DVDs and iranian classical music CDs on the web. Discussion
of generative
models which can result in power law distributions over network
degrees.
- L16 Audio
of [March 11, 2010] (Video of the lecture video part 1 (the first 1hr; battery died after that :( )
Attacks vs. disruptions on powerlaw vs. exponential networks;
navigation on social networks; applications of social networks;
discussion of trust and reputation; trust rank (a page-rank variant);
discussion of social search (and aardvark); discussion of othe
powerlaws in cse494--zipf's law; heap's law (and even benford's law).
Topics below this line are not included for Midterm
-
Crawling & Hardware Issues; Clustering of search results.
- L17 [March 23, 2010](Audio missing)Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)
Issues in designing a crawler; tradeoff between page importance and
page change rate in deciding crawling priority; how Mercator supports
politeness and priority; Connectivity server and its connection to
inverted indices; start of hardware issues and the role for
coarse-grained parallelism in search engines.
- L18 [March 25, 2010]Audio;
Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)
Map-reduce parallelism, and how indexing can be done in map-reduce
format. Clustering start. Examples of search results
clustering. Computing cluster descriptions using tf/icf
measures. Impact of "feature representation", "distance measures",
cluster tightness and inter-cluster distance on clustering. Internal
vs. external measures of quality for clustering.
- L19 [Mar 30, 2010]Audio;
Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)
Internal and external measures of evaluting clustering; on the number
of different k clusterings over n data points (and the realization
that the full optimization can be computationally hard); distance
measures for text clustering, and the useful properties of dot product
distance metrics (such as cosine metric); local-search algorithms for
clustering; k-means clustering--example, the tradeoffs in synchronous
(centroid computed only once per iteration) and asynchronous (centroid
re-computed everytime a change of membership occurs); explaining that
given K, we only need to focus on intra-cluster tightness; G-measure
which is the sum of squared distances of each point from the centroid
of the cluster it belongs to; noticing that k-means comes to a minimum
on G measure--but it can be local minimum; the discussion that one way
to make k-means globablly optimal would be to allow simultaneous
switch of sets of points together; alternative idea of considering
multiple initial centroids and seeing which gives a better G
measure. Problems with K-means--deciding on K using the knee points in
the plot of G vs. k.
- L20 [Apr 6, 2010]Audio;
Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) More discussion on K-means and its limitations; "semi-supervised" clustering--where you are given "must-link" and "must-not-link" information. Hierarchical clustering methods--HAC and Bisecting K-means; discussion of how the inter-cluster distance definition changes the way HAC produces clusters. Buckshot and bisecting K-means as hybrid methods. Start of classification--and discussion of Rocchio Relevance Feedback as a classification technique.
-
Text Classification
- L21 [Apr 8, 2010]Audio;
Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)More on Rocchio Relevance feedback; long discussion on why rocchio and k-NN ideas are not good enough for classification (and how classification requires clusters to respect the labels--which may make the clusters not have sufficient internal quality); general classification methods, and why Naive Bayes is a good approach for text classification; discussion of Navie bayes and how it gets by with fewer learned parameters by making strong independence assumptions (and how those are also the same independence assumptions made by the vector similarity model)
- L22 [Apr 13, 2010]Audio;
Video of the lecture video part 1(only first 30 min recorded due to battery failure). Naive bayes discussion continued; need for laplacian smoothing;
unigram model of text; feature selection using mutual information;
example of using mutual information to rank features; realization that
MI-based approaches tend to focus on individual features; connection to
dimensionality reduction which finds linear (or non-linear)
combination of features that are most relevant. Start of
recommendation systems.
-
Recommendation Systems
- L23 [Apr 15, 2010]Audio;
Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)Recommendation systems. Content vs. Collaborative. Discussion of the collaborative filtering algorithm. Relative advantages of content vs. collaborative filtering approaches; a way to combine them by using content-based filtering to handle the sparse ratings problem. Adapting that idea to content-based filtering itself and using unlabeled examples to improve the performance of a classifier.
-
Specifying and Exploiting Structure
- L24 [Apr 20, 2010]Audio;
Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)Discussion of item-centered approaches for collaborative filtering (and the connections to association and scalar clustering); discussion of LSI-style approaches to collaborative filtering, and their connection to the NETFLIX prize winner; the technical challenge of doing LSI in the presence of null values (unrated items)--two methods--imputation and going with a dot product on just on the co-rated items; the issue of over-fitting associated with the latter idea, and handling the overfitting through regularization term as part of low-rank approximation.
Overview of the part 3 of the course which is about specifying and
exploiting structure--and how XML and Information Extraction fit in
the "specifying/extracting structure", how Xquery and
structure-exploiting query processing fit into "exploiting structure"
and how the semantic web fits into "integrating structured
sources". (Important: make sure you get this discussion as it lays out
the overview of the rest of the semester)
- L25 [Apr 22, 2010]Audio;
Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)XML--motivations for the standard, a little on the syntax, view of XML from IR and DB, path-querying of XML from IR side; XML as a data model and as an "exchange language" for data.
- L26 [Apr 27, 2010]Audio;
Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)XML from Database side--XML DTD, XML Schema, Xquery and its use cases; comparing XQuery and SQL; challenges in putting an XML front-end to relational databases--tagging and converting Xquery to SQL. XML and meaning--"Schema-abiding XML as grammatically correct text, which needs semantic (meaning) constraints". Mapping tags and the semantic web dream.
- L27 [Apr 29, 2010]Audio;
Video of the lecture video part 1 (the first 44min; the rest didn't get recorded :-( ) Starting Semantic web--and motivation through Ramayana; Motivating RDF and OWL standards by viewing RDF as providing the "base facts" in the text, and "OWL/RDF-SChema" as providing the background knowledge. Examples of RDF and OWL standards; critical discussion of expressiveness tradeoffs made by RDF and OWL. Use cases if semantic web standards.
Information extraction as a way of converting text into RDF-triples
stage (which is also equivalent to populating a database given some
text). Why it is different from full-blown NLP (if IR is a poorman's
NLP, IE is a middle-class mans NLP..), why we think it is possible
(especially the unwrapping issue).
-
Information Extraction
- L28 [Apr 30, 2010]Audio;
Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)Information extraction; its history; how it differs from NLP, the spectrum of approaches for information extraction and how the scope of the tasks (e.g pattern scope, unary vs. n-ary relation extraction etc) make the task harder; three specific techniques--wrappers/unwrappers that use path expressions on the dom tree to extract information from highly templated text; extractors that use hyponymy patterns on parts-of-speech parse trees to extract general facts; and semtag/seeker which proposes a tag beureau.
-
Information
Integration
-
Sayanora
- L29 [May 4, 2010]Audio;
Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)Information Integration: A brief introduction of the
spectrum of approaches and methods; a short discussion of the
sourcerank work; Ending.
- What
students seem to remember
- 5/11: Final 9:50 - 11:40 AM
Last modified: Mon Aug 15 09:03:52 MST 2011