[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
I did make the flight...
As I kept going past 1:15pm when I said I would stop to run off to get
by flight, I could see several concerned looks in the class.. So I am
sure you will be glad to know that I did make my flight to DC (and am
typing this on the flight ;-).
Anyways, my apologies for not sticking to my time-frame and thus
failing to do a longer discussion of semantic web today--especially if
you read the sci am paper and were rearing to discuss the stuff. It is
hard not to get caught up with my pet topics--the multi-objective
optimization stuff as well as the statistics gathering stuff are part
of Havasu system, our own home-brew data integration effort. Your
grader (who, I notice has increasingly become your TA too and finds
herself helping on homework problems while i while away my office
hours undisturbed ;-)) is working on that stuff, along with Nie and
Ullas. You can get more information on it at
http://rakaposhi.eas.asu.edu/havasu.html
Come chat with either them or me if you were intrigued by anything
that was said today (although judging by the rather glum look on most
peoples faces, about the only intrigue in the air today was "is he
going to miss his flight or what?")
A couple of comments on what happened in the class:
--I mentioned that to reduce statistics we (in Havasu) remember statistics not with
respect to queries, but rather with respect to *classes* of queries. I did however
got away without considering the question of how these query classes are generated
There are two possible ways we tried (are trying)
1. We can make use of any naturally available hierarchies in the values that
specific values of an attribute can take. For example, if we are fielding
selection queries on CS bibliographies, then the conference names for the
queries--aaai, sigmod etc--may be leaves of a hierarchy of subject areas
{CS [DB (SIGMOD VLDB ..)][AI (aaai, IJCAI.._]...}
We can then define query classes in terms of interior nodes of this hierarachy
(e.g. AI queries, DB queries etc). Given a query, we will then have to
map it into one of these classes, and then use the statistics for that class
2. Another approach, which is probably more intiguing given all that we did in this
course, is to automatically "cluster" queries into classes. Now the trick of course
is to define a "distance metric" between queries that we can use. The metrics
will ultimately depend not on the queries themselves but the (overlap) between
the results of running the queries (and since we don't really want to run all
possible selection queries, we need to consider effective statistical sampling
techniques). Nie and Ulas are each exploring a different type of distance
metric for clustering queries (and one of them winds up considering queries as
vectors of bags (that we discussed with respect to NBC on text)--run the query,
get the representative tuples, and make a super tuple where each attribute value
is the bag of attribute values of all the original tuples
{A 1 u} {B 2 v} become {(A,B), (1,2), (u,v)}--and then do a normal bag similarity
metric (the kind of one you used in homework 2, clustering question) on the bags
of the queries to compute the similarity between the queries.
[[At a meta-level the foregoing can be interpreted as (a) hey look, he
is trying to teach us what his grad students are doing--no wonder we
can't absorb much or (b)hey look he is trying to teach us what his
phd students are doing--and that doesn't seem all that hard given
what we already know--if they can do it, we can too.]]
=====================
In regards to the questions by Mark Chung about modeling and handling accuracy--
In general, the question of how one goes about modeling and measuring quality of the
data is pretty much open.
Quality of the data can have many dimensions and we don't even really know what all those
dimensions are. Here are some measures;
--Are the tuples exposed by the database actually sound (with respect to some
external objective reality?)
--How "fresh" are the tuples (this is a sort of a twist on the above for time-sensitive
data..
--Are the tuples "dense" (i.e., on the average, how many null values(missing attribute
values) are there per tuple?
[You can write more]
Quality of the source itself can be characterized both interms of the
aggregate measures of the quality of its tuples
Interesting questions of course are:
--how to estimate them
--how to propagate them (i.e., given the quality parameters for some tuples,
what will the quality parameter for their join etc.)
Here is the paper that I mentioned in the class in connection to quality stuff:
http://www.acm.org/sigs/sigmod/pods/proc01/online/p50.pdf
That is all for now..
Rao
[Dec 04, 2002]