Yaron Kanza The Rachel and Selim Benin School of Engineering and Computer Science The Hebrew University of Jerusalem Oblivious Querying of Data with Irregular Structure
Joint Work With Queries with Incomplete Answers Flexible Queries SQL4X Computing Full Disjunctions
The Semistructured Data Model Data is described as a rooted labeled directed graph Nodes represent objects Edges represent relationships between objects Atomic values are attached to atomic nodes
Irregular Data Data is incomplete - Missing values of attributes in objects
Data has structural variations - Relationships between objects are represented differently in different parts of the database
Data has ontology variations - Different labels are used to describe objects of the same type
Can Regular Expressions Help in Querying Irregular Data? In many cases, regular expressions can be used to query irregular data Yet, regular expressions are - Not efficient – it is difficult to evaluate regular expressions
- Not intuitive – it is difficult for a naïve user to formulate regular expressions
More on Using Regular Expressions When querying irregular data, the size of the regular expression could be exponential in the number of labels in the database - For n types of objects, there are n! possible hierarchies
- For an object with n attributes, there are 2n subsets of missing attributes
Queries with Incomplete Answers We have developed queries that deal with incomplete data in a novel way and return incomplete answers The queries return maximal answers rather than complete answers Different query semantics admit different levels of incompleteness
The queries are labeled rooted directed graphs Query nodes are variables Matchings are assignments of database objects to the query variables according to - the constraints specified in the query, and
- the semantics of the query
Constraints On Complete Matchings
The Reachability Constraint on Partial Matchings A query node v that is mapped to a database object o satisfies the reachability constraint if there is a path from the query root to v, such that all edge constraints along this path are satisfied
“And” Matchings A partial matching is an AND matching if - The root constraint is satisfied
- The reachability constraint is satisfied by every query node that is mapped to a database node
- If a query node is mapped to a database node, all the incoming edge constraints are satisfied
Weak Satisfaction of Edge Constraints
Weak Matchings A partial matching is a weak matching if - The root constraint is satisfied
- The reachability constraint is satisfied by every query node that is mapped to a database node
- Every edge constraint is weakly satisfied
“OR” Matchings A partial matching is an OR matching if - The root constraint is satisfied
- The reachability constraint is satisfied by every query node that is mapped to a database node
Increasing Level of Incompleteness A complete matching is an AND matching An AND matching is a weak matching A weak matching is an OR matching
Maximal Matchings A tuple t1 subsumes a tuple t2 if t1 is the result of replacing some null values in t2 by non-null values: A matching is maximal if no other matching subsumes it A query result consists of maximal matchings only
On the Complexity of Computing Queries with Incomplete Answers The size of the result can be exponential in the size of the input (database and query) - Note that the same is true when joining relations – the size of the result can be exponential in the size of the input (database and query)
Instead of using data complexity (where the runtime depends only on the size of the database), we use input-output complexity
Input-Output Complexity
The Motivation for Using I/O Complexity Measuring the time complexity with respect to the size of the input does not separate between the following two cases: - An algorithm that does an exponential amount of work simply because the size of the output is exponential in the size of the input
- An algorithm that does an exponential amount of work even when the query result is small
- Either the algorithm is naïve (e.g., it unnecessarily computes subsumed matchings) or the problem is hard
I/O Complexity of Query Evaluation (lower bounds are for non-emptiness)
Filter Constraints Constraints that filter the results (i.e., the maximal matchings) There are - Weak filter constraints (the constraint is satisfied if a variable in the constraint is null)
- Strong filter constraints (all variables must be non-null for satisfaction)
Existence constraint: !x is true if x is not null
I/O Complexity of Query Evaluation with Existence Constraints (lower bounds are for non-emptiness)
I/O Complexity of Query Evaluation with Weak Equality/Inequality Constraints (lower bounds are for non-emptiness)
Query Containment Query containments for queries with incomplete answers is defined differently from query containment for queries with complete answers Q1 Q2 if for all database D, - every matching of Q1 w.r.t. to D
- is subsumed by
- a matchings of Q2 w.r.t. to D
Query containment (query equivalence) is useful for the development of optimization techniques
Containment in AND Semantics Homomorphism between the query graphs is necessary and sufficient for containment
Containment in OR Semantics The following is a necessary and sufficient condition for query containment in OR semantics For every spanning tree T1 of the contained query, there a spanning tree T2 of the containing query, such that there is a homomorphism from T2 to T1 - is in ΠP2
- NP-Complete if the containee is a tree
- polynomial if the container is a tree
Containment in Weak Semantics Similar to containment in OR Semantics, with the following difference Instead of checking homomorphism between spanning trees, we check homomorphism between graph fragments - A graph fragment is a restriction of the query to a subset of the variables that includes the query root such that every node in the fragment is reachable from the root
Flexible Queries To deal with structural variations in the data, we have developed flexible queries
Example
Constraints On Rigid Matchings
A Semiflexible Matching The query root is mapped to the db root
A Semiflexible Matching The query root is mapped to the db root
The query root is mapped to the db root
An Example of a Flexible Query
Differences Between the Semiflexible and Flexible Semantics On a technical level, in flexible matchings - Query paths are not necessarily embedded in database paths
- SCC’s are not necessarily mapped to SCC’s
On a conceptual level, in the semiflexible semantics, nodes are “semantically related” if they are on the same path, and hence - Query paths are embedded in database paths
In the flexible semantics, this condition is relaxed: - Query edges are embedded in database paths
Increasing Level of Flexibility A rigid matching is a semiflexible matching A semiflexible matching is a flexible matching
Verifying that Mappings are Semiflexible Matchings Is a given mapping of query nodes to database nodes a semiflexible matching? - Not as simple as for rigid matchings (no local test, i.e., need to consider paths rather than edges)
In a dag query, the number of paths may be exponential - Yet, verifying is in polynomial time
In a cyclic query, the number of paths may be infinite - Yet, verifying is in exponential time
Verifying that a Mapping is a Semiflexible Matching
Input-Output Complexity of Query Evaluation for the Semiflexible Semantics - Polynomial for a dag query and a tree database (or simpler cases)
- Rather difficult to prove, even when the query is a tree, since there is no local test for verifying that mappings are semiflexible matchings
- Exponential lower bounds for other cases
I/O Complexity for SF Semantics (lower bounds are for non-emptiness)
Query Evaluation for the Flexible Semantics The database is replaced with a relationship graph which is a graph, such that - The nodes are the nodes of the database
- Two nodes are connected by an edge if there is a path between them in the database (the direction of the path is unimportant)
The query is evaluated under rigid semantics w.r.t. the relationship graph
I/O Complexity of Query Evaluation for the Flexible Semantics Results follow from a reduction to query evaluation under the rigid semantics Tree query - Input-Output complexity is polynomial
DAG query - Testing for non-emptiness is NP-Complete
Query Containment Q1 Q2 if for all database D, - the set of matchings of Q1 w.r.t. to D
- is contained in
- the set of matchings of Q2 w.r.t. to D
We assume that - Both queries have the same set of variables
Complexity of Query Containment Under the semiflexible semantics, Q1 Q2 iff the identity mapping from the variables of Q2 to the variables of Q1 is a semiflexible matching of Q2 w.r.t. Q1 Thus, containment is - in coNP when Q1 is a cyclic graph and Q2 is either a dag or a cyclic graph
- in polynomial time in all other cases
Under the flexible semantics, query containment is always in polynomial time
Database Equivalence D1 and D2 are equivalent if for all queries Q, - the set of matchings of Q w.r.t. to D1
- is equal to
- the set of matchings of Q w.r.t. to D2
Both databases must have the same set of objects and the same root
Complexity of Database Equivalence For the semiflexible semantics, deciding equivalence of databases is - in polynomial time if both databases are dags
- in coNP if one of the databases has cycles
For the flexible semantics, deciding equivalence of databases is polynomial in all cases
Database Transformation
Transforming a Database into a Tree Reasons for transforming a database into an equivalent tree database: - Evaluation of queries over a tree database is more efficient
- In a graphical user interface, it is easier to represent trees than DAGs or cyclic graphs
- Storing the data in a serial form (e.g., XML) requires no references
Transformation into a Tree There are algorithms for - Testing if a database can be transformed into an equivalent tree database, and
- Performing the transformation
For the semiflexible semantics - The algorithms are polynomial
For the flexible semantics
Implementing Flexible Queries Flexible queries were implemented in SQL4X In an SQL4X query, relations and XML documents are queried simultaneously A query result can be either a relation or an XML document
Combining the Paradigms In oblivious querying: - The user does not have to know where data is incomplete
- The user does not have to know the exact structure of the data
The paradigm of flexible queries and the paradigm of queries with incomplete answers should be combined
Flexible Queries with Incomplete Answers A flexible query w.r.t. a database is actually a rigid query w.r.t. the relationship graph Evaluating a query in AND-semantics (weak semantics, OR-Semantics) w.r.t. the relationship graph produces a flexible query that returns maximal answers rather than complete answers
Full Disjunction Intuitively, the full disjunction of a given set of relations is the join of these relations that does not discard dangling tuples Dangling tuples are padded with nulls Only maximal tuples are retained in the full disjunction (as in the case of QwIA)
In the Full Disjunction of a Given Set of Relations:
Motivation for Full Disjunctions Full disjunctions have been proposed by Galiando-Legaria as an alternative for outerjoins [SIGMOD’94] Rajaraman and Ullman suggested to use full disjunctions for information integration [PODS’96]
Computing Full Disjunctions for γ-acyclic Relation Schemas Rajaraman and Ullman have shown how to evaluate the full disjunction by a sequence of natural outerjoins when the relation schemas are γ-acyclic Hence, the full disjunction can be computed in polynomial time, under input-output complexity, when the relation schemas are γ-acyclic
Weak Semantics Generalizes Full Disjunctions Relations can be converted into a semistructured database The full disjunction can be expressed as the union of several queries that are evaluated under weak semantics
Generalizing Full Disjunctions In a full disjunction, tuples are joined according to equality constraints as in a natural join (or equi-join) We can generalize full disjunctions to support constraints that are not merely equality among attributes
Example
Another Way of Generalizing Full Disjunctions: Use OR-Semantics OR-semantics is used rather than weak semantics when tuples are joined This relaxes the requirement that every pair of tuples should be join consistent Instead, a tuple of the full disjunction is only required to be generated by database tuples that form a connected subgraph, but need not be pairwise join consistent
Conclusion Flexible and semiflexible queries facilitate easy and intuitive querying of semistructured databases - Querying the database even when the user is oblivious to the structure of the database
- Queries are insensitive to variations in the structure of the database
Conclusion (continued) Queries in AND semantics, OR semantics or weak semantics facilitate easy and intuitive querying of incomplete databases - Querying the database even when the user is oblivious to missing data
- Queries return maximal answers rather than complete answers
Conclusion (continued) The two paradigms of flexible queries and queries with maximal answers can be combined The combination of the paradigms can facilitate integration of information from heterogeneous sources
Conclusion (continued) Full disjunctions can be computed using queries in weak semantics Full disjunctions can be generalized so that relations are joined using constraints that are not merely equality constraints
Do'stlaringiz bilan baham: |