CSE 494/598 Information Retrieval, Mining and Integration on the Internet

Next Offering: Spring 2016; Friday 12:15-2:45 (Instructor: Lydia Manikonda)

Additional pointers:

Check out what students say about the last offering of this course in CEAS student evaluations.
Check out what students "learned" from the last offering of this course (in their own words).

Class Piazza Forum (Spring 2015)

Link to the slides used on Friday 1/16/2015 NEW (and here is the audio of the lecture)

Lecture Notes and Recorded Video

Instructor: Subbarao Kambhampati
Class: T/Th 4:30--5:45 BYAC 270
Office Hours: T/Th 1-2pm BY560

Introduction [Aug 18, 2011]
- L1. [Aug 18, 2011] Audio of the lecture 1. Introduction to the course..
Course Overview + Big themes
- L2 Audio of [Aug 23, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) A more technical overview of the course--why structure is important and how we can exploit it, how we can specify or extract it. The course as bringing traditional disciplines to Web--how IR is brought to web.
- L3 Audio of [Aug 23, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Extending social networks, information integration and classification learning to web. Four BIG and Cross-cutting ideas.
Information Retrieval
- L4 Audio of [Aug 30, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Start of traditional IR. The problem. The evaluation strategy using Precision/Recall. Relevance--the central concept in traditional IR and how to compute it. What does it depend on? How to find its functional form? How we decide to approximate.
- L5 Audio of [Sep 1, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Representation choices for D, Q and U--Shingles, words, sentences, meaning etc. Semantics for the collection--sets, bags, vectors etc. Desiderata for similarity metrics. Boolean retrieval models. Set/Bag retrieval models. Jaccard simialrity. Normalizing it.
- L6 Audio of [Sep 6, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Vector Space similarity. Euclidean and Cosine-theta similarity. tf/idf corrections to the vector space weights.
Indexing, Retrieval, and Tolerant Dictionaries
- L7
  
  Audio of [Sep 8, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Need for inverted indexes, inverted index datastructures, using inverted indexes, approximate retrieval, start of tolerant dictionaries.
Correlation analysis and Latent Semantic Indexing (Annotated version of the matlab session playing with SVD is here.).
- L8 Audio of [Sep 13, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) K-gram and edit-distance measures. Using them in ranking word suggestions. Bayesian account of spelling correction.
  Improvements to Vector Space Similarity. (very) Brief discussion of relevance feedback. Discussion of term correlation statistics.
- L9 Audio of [Sep 15, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Association and scalar clusters for correlation analysis. Computing them over global document corpus vs. over query specific corpus vs. over query logs. Connection to collaborative filtering, gmail recepient suggestions.
  Beyond correlation computation--latent semantic indexing. Motivation through malicious oracle. Illustration through fish. Connecting to SVD. Seeing that SVD is doing a low-rank approximation of a matrix.
- L10 Audio of [Sep 20, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) THE BIG LSI LECTURE.
Doing IR on Web: Anchor Text; Page Importance Measures
- L11 Audio of [Sep 22, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) First 30min is about questions on LSI and correlation analysis. The remaining part transitions into IR for WEB, talks about the challenges/opportunities of the web, using tag and anchor structure to improve retrieval. Finally, we talk about the need for having page importance measures, and the desiderata for them.
- L12 Audio of [Sep 27, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)Authorities & Hubs and its relation to primary eigen vectors of AA' and A'A matrices. Discussion about power iteration. Page rank, and stabilizing a stochastic matrix so the corresponding markov chain has a steady state distribution.
- L13 Audio of [Sep 29, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)Side-by-side discussion of authorities/hubs and pagerank link-based importance measures with pagerank. Combining importance and similarity measures. Global vs. query-specific vs. topic-specific importance computation. Evaluating the relevance of a link to a query.
- L14 Audio of [Oct 4th, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)Sushovan's introduction to project part 2. Discussion of the many uses of reset matrix (in terms of recency rank, trust rank etc), stability w.r.t disruption and attack. Discussion of A&H tyranny of majority; how that leads to instability, understanding the instability in terms of eigen gap, two solutions to improve A&H stability--weak links or cross-product of eigen vectors. Robustness to adversarial attack and how it global importance measures are more suceptible to adversarial attack than query-time importance measures. Multiple attacks on page-rank--starting with collusion between pages.
Social networks and their applications on the Web
- L14 Audio of [March 5, 2010] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Efficient computation of pagerank (and how it is important to not represent the M* matrix explicitly given that it is not sparse); doing block-based pagerank iteration to avoid thrashing, the use of asynchronous pagerank iteration to improve convergence.
  Social networks start. Connections between link-analysis and the general field of social network analysis. Some iconic examples of social network analysis (Typhoid Mary, Patient Zero, Web graph (and its small-world nature), Offcial florida ballot viral spread, Saddam capture, Aardvark acquisition). Applications of social networks. Graph-based representation and analysis of social entworks. Measures of influence and centrality. Smallworld phenomena--and their examples in kevin bacon game and erdos number.
- L15 Audio of [March 9, 2010] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Milgram experiment; six-degrees of separation; (uniform) random networks and their properties; realizing that the small world probability increases sharply to 1 right near k=1; where k is the average (expected) degree of the random network. If human networks are (uniform) random, then they will have small-world phenomena (since k, i.e., average number of friends per person, is almost always greater than 1). Trying to confirm whether large-scale social networks are in fact uniform random by comparing their degree distribution to the Poisson degree distribution expected for random networks. Realizing that most real world network degree distributions instead correspond to negative sloping straightlines in log-log space (which means they are of the form P=1/k^r, which is called powerlaws. Discussion of the properties of power-law disributions (which have long tails that fall off only polynomially rather than exponentially). Implications of long tails on everything from probability of existence of such highly-linked sites as google to the ability of making money selling west wing DVDs and iranian classical music CDs on the web. Discussion of generative models which can result in power law distributions over network degrees.
- L16 Audio of [March 11, 2010] (Video of the lecture video part 1 (the first 1hr; battery died after that :( ) Attacks vs. disruptions on powerlaw vs. exponential networks; navigation on social networks; applications of social networks; discussion of trust and reputation; trust rank (a page-rank variant); discussion of social search (and aardvark); discussion of othe powerlaws in cse494--zipf's law; heap's law (and even benford's law).
Engineering Issues in Web Search. Crawling; map-reduce; efficient indexing and link analysis
- L15 Audio of [Oct 6th, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)Design issues in web-crawling. Discussion of map-reduce parellelism and distributed file systems.
- L16 Audio of [Oct 11th, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)Continuation of map-reduce architectures; examples of map-reduce implementation of indexing, efficient pagerank computation. Start of clustering.
Clustering of Text Data
- L17 Audio of [Oct 14th, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)Clustering continued. Notions of hard/vs soft clusters; importance of distance measures; internal and external evaluation metrics for clusterings, k-means clustering.
- L18 Audio of [Oct 20th, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min)Problems with K-means clustering. Hierarchical clustering methods--divisive (bisecting k-means) and agglomerative, buck-shot clustering. Clustering on text, use of LSI to reduce dimensions before clusterings, making cluster snippets.
Text Categorization
- L19 Audio of [Oct 25th, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Text Classification/Categorization. Evaluating classification techniques. Using classifiers as basis for retrieval (aka relevance feedback). Distance-based classification strategies--Rochchio, k-nearest neighbors. Their relative advantages/disadvantages. (Aside on LDA--linear discriminant analysis and LSI as a special case of LDA where each element is its own class). Why distance-based methods are not enough. Learning as pattern-finding. Multiple pattern languages (biases) and their relative tradeoffs.
- L20 Audio of [Oct 27th, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Parametric vs non-parametric learners. kNN as an example of non-parametric learner. Discussion of Naive Bayes classifier--with background on bayes networks, and the assumptions underlying NBC, and why it still works. Smoothing probability estimates and laplacian smoothing.
Recommender Systems
- L21 Audio of [Nov 1st, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) The theory behind NBC learning (in terms of maximizing likelihood). NBC applied to Text--unigram model. Feature selection using mutual information. connection between feature selection and LSI and LDA.
  Recommendation systems. Content-based filtering and application of naive bayes classifier to vector of bags model of text. Collaborative filtering and its relative tradeoffs vis a vis content-based filtering.
- L22 Audio of [Nov 3rd, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Complete discussion on collaborative filtering; NETFLIX prize winning enty and their use of LSI; Combining content-based and collaborative filtering; approaches to using unlabelled examples in classification.
  (last 15min) Search Advertising. How it is different from traditional advertising. The three parties: users, search engine, advertisers--and their differeing utility models. Balancing them. Birds-eye view of most of the important challenges.
Search Advertising
- L23 Audio of [Nov 8, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Search advertising. The simple picture. Complications. How to handle them. (Includes handling budget constraints, handling ranking of ads, setting prices using sealed bid auctions).
Specifying and Exploiting Structure
- L24 Audio of [Nov 10, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Overview of specifying and exploiting structure. XML as a structure specification langauge. Viewing XML from the point of view of what structure it supports. Understanding IR-style querying on XML. Understanding database-style querying on XML (start).
- L25 Audio of [Nov 15, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) XML from database side; XML schema specification, XQuery--examples and comparision to SQL. XML and Meaning. Ramayana as a vehicle to motivate RDF and OWL.
- L26 Audio of [Nov 17, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) RDF and OWl standards and what they are useful for. Linked data and SPARQL. How to use OWL background knowledge for source alignment.
  Start of Information Extraction.
Automated extraction of Structure: Information Extraction
- L27 Audio of [Nov 22, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Overview of the scope of information extraction tasks, the easy-to-hard spectrum, overview of techniques.
- L28 Audio of [Nov 29, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Three examples of IE techniques: Wrapper generation; pattern extraction using hyponym patterns; sequence extraction using hidden markov models (majority of the class). Connection between HMMs and Markov Chains. The computational problems of sequence likelihood; most like state sequence; and learning parameters--and a sketch on how they are solved.
Information Integration
- L29 Audio of [Dec 1, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) More discussion on HMMs. Discussion information integration--motivations, use cases, three types of architectures.
- L30 Audio of [Dec 6, 2011] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Information Integration: Dimensions of variation and related challenges. Plug for QBayes. Corny ending.
End review

Final Exam: Tuesday December 13th 2:30--4:20pm

Linear Algebra Review (by Sushovan De on [Feb 10, 2010])
Social networks and their applications on the Web
- L14 Audio of [March 5, 2010] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Efficient computation of pagerank (and how it is important to not represent the M* matrix explicitly given that it is not sparse); doing block-based pagerank iteration to avoid thrashing, the use of asynchronous pagerank iteration to improve convergence.
  Social networks start. Connections between link-analysis and the general field of social network analysis. Some iconic examples of social network analysis (Typhoid Mary, Patient Zero, Web graph (and its small-world nature), Offcial florida ballot viral spread, Saddam capture, Aardvark acquisition). Applications of social networks. Graph-based representation and analysis of social entworks. Measures of influence and centrality. Smallworld phenomena--and their examples in kevin bacon game and erdos number.
- L15 Audio of [March 9, 2010] (Video of the lecture video part 1 (the first 1hr 5min 4gb) and video part 2 (the remaining 10+ min) Milgram experiment; six-degrees of separation; (uniform) random networks and their properties; realizing that the small world probability increases sharply to 1 right near k=1; where k is the average (expected) degree of the random network. If human networks are (uniform) random, then they will have small-world phenomena (since k, i.e., average number of friends per person, is almost always greater than 1). Trying to confirm whether large-scale social networks are in fact uniform random by comparing their degree distribution to the Poisson degree distribution expected for random networks. Realizing that most real world network degree distributions instead correspond to negative sloping straightlines in log-log space (which means they are of the form P=1/k^r, which is called powerlaws. Discussion of the properties of power-law disributions (which have long tails that fall off only polynomially rather than exponentially). Implications of long tails on everything from probability of existence of such highly-linked sites as google to the ability of making money selling west wing DVDs and iranian classical music CDs on the web. Discussion of generative models which can result in power law distributions over network degrees.
- L16 Audio of [March 11, 2010] (Video of the lecture video part 1 (the first 1hr; battery died after that :( ) Attacks vs. disruptions on powerlaw vs. exponential networks; navigation on social networks; applications of social networks; discussion of trust and reputation; trust rank (a page-rank variant); discussion of social search (and aardvark); discussion of othe powerlaws in cse494--zipf's law; heap's law (and even benford's law).

Last modified: Tue Mar 13 07:06:36 MST 2012

CSE 494/598 Information Retrieval, Mining and Integration on the Internet

Next Offering: Spring 2016; Friday 12:15-2:45 (Instructor: Lydia Manikonda)

Additional pointers: Check out what students say about the last offering of this course in CEAS student evaluations. Check out what students "learned" from the last offering of this course (in their own words).

Class Piazza Forum (**Spring 2015**)

Link to the slides used on Friday 1/16/2015 *NEW* (and here is the audio of the lecture)

Lecture Notes and Recorded Video

Instructor: Subbarao Kambhampati Class: T/Th 4:30--5:45 BYAC 270 Office Hours: T/Th 1-2pm BY560

Additional pointers:

Check out what students say about the last offering of this course in CEAS student evaluations.
Check out what students "learned" from the last offering of this course (in their own words).

Class Piazza Forum (Spring 2015)

Link to the slides used on Friday 1/16/2015 NEW (and here is the audio of the lecture)

Instructor: Subbarao Kambhampati
Class: T/Th 4:30--5:45 BYAC 270
Office Hours: T/Th 1-2pm BY560