[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

I did make the flight...



As I kept going past 1:15pm when I said I would stop to run off to get
by flight, I could see several concerned looks in the class.. So I am
sure you will be glad to know that I did make my flight to DC (and am
typing this on the flight ;-).

Anyways, my apologies for not sticking to my time-frame and thus
failing to do a longer discussion of semantic web today--especially if
you read the sci am paper and were rearing to discuss the stuff. It is
hard not to get caught up with my pet topics--the multi-objective
optimization stuff as well as the statistics gathering stuff are part
of Havasu system, our own home-brew data integration effort. Your
grader (who, I notice has increasingly become your TA too and finds
herself helping on homework problems while i while away my office
hours undisturbed ;-)) is working on that stuff, along with Nie and
Ullas. You can get more information on it at

 http://rakaposhi.eas.asu.edu/havasu.html

Come chat with either them or me if you were intrigued by anything
that was said today (although judging by the rather glum look on most
peoples faces, about the only intrigue in the air today was "is he
going to miss his flight or what?")

A couple of comments on what happened in the class: 


--I mentioned that to reduce statistics we (in Havasu) remember statistics not with 
  respect to queries, but rather with respect to *classes* of queries. I did however
  got away without considering the question of how these query classes are generated
  There are two possible ways we tried (are trying)

    1. We can make use of any naturally available hierarchies in the values that 
      specific values of an attribute can take. For example, if we are fielding 
      selection queries on CS bibliographies, then the conference names for the 
      queries--aaai, sigmod etc--may be leaves of a hierarchy of subject areas
      {CS [DB (SIGMOD VLDB ..)][AI (aaai, IJCAI.._]...}
      We can then define query classes in terms of interior nodes of this hierarachy
      (e.g. AI queries, DB queries etc). Given a query, we will then have to 
      map it into one of these classes, and then use the statistics for that class

    2. Another approach, which is probably more intiguing given all that we did in this
      course, is to automatically "cluster" queries into classes. Now the trick of course
     is to define a "distance metric" between queries that we can use. The metrics
     will ultimately depend not on the queries themselves but the (overlap) between 
     the results of running the queries (and since we don't really want to run all
     possible selection queries, we need to consider effective statistical sampling
     techniques). Nie and Ulas are each exploring a different type of distance
     metric for clustering queries (and one of them winds up considering queries as
     vectors of bags (that we discussed with respect to NBC on text)--run the query, 
     get the representative tuples, and make a super tuple where each attribute value
     is the bag of attribute values of all the original tuples
     {A 1 u} {B 2 v} become {(A,B), (1,2), (u,v)}--and then do a normal bag similarity
     metric (the kind of one you used in homework 2, clustering question) on the bags
     of the queries to compute the similarity between the queries. 

[[At a meta-level the foregoing can be interpreted as (a) hey look, he
 is trying to teach us what his grad students are doing--no wonder we
 can't absorb much or (b)hey look he is trying to teach us what his
 phd students are doing--and that doesn't seem all that hard given
 what we already know--if they can do it, we can too.]]

=====================
In regards to the questions by Mark Chung about modeling and handling accuracy--
 In general, the question of how one goes about modeling and measuring quality of the
 data is pretty much open.
 
 Quality of the data can have many dimensions and we don't even really know what all those
 dimensions are. Here are some measures;

 --Are the tuples exposed by the database actually sound (with respect to some 
    external objective reality?)

 --How "fresh" are the tuples (this is a sort of a twist on the above for time-sensitive
   data..

 --Are the tuples "dense" (i.e., on the average, how many null values(missing attribute
   values) are there  per tuple?

  [You can write more]

 Quality of the source itself can be characterized both interms of the
  aggregate measures of the  quality of its tuples

 Interesting questions of course are:

  --how to estimate them
  --how to propagate them (i.e., given the quality parameters for some tuples,
     what will the quality parameter for their join etc.)

Here is the paper that I mentioned in the class in connection to quality stuff:
  
  http://www.acm.org/sigs/sigmod/pods/proc01/online/p50.pdf
 

That is all for now.. 

Rao
[Dec 04, 2002]