Yaron Kanza The Rachel and Selim Benin School of Engineering and Computer Science The Hebrew University of Jerusalem


Download 557 b.
Sana24.05.2018
Hajmi557 b.


Yaron Kanza The Rachel and Selim Benin School of Engineering and Computer Science The Hebrew University of Jerusalem

  • Oblivious Querying of Data with Irregular Structure


Joint Work With

  • Queries with Incomplete Answers

    • Werner Nutt, Shuky Sagiv
  • Flexible Queries

    • Shuky Sagiv
  • SQL4X

    • Sara Cohen, Shuky Sagiv
  • Computing Full Disjunctions

    • Shuky Sagiv






The Semistructured Data Model

  • Data is described as a rooted labeled directed graph

  • Nodes represent objects

  • Edges represent relationships between objects

  • Atomic values are attached to atomic nodes















Irregular Data

  • Data is incomplete

    • Missing values of attributes in objects
  • Data has structural variations

    • Relationships between objects are represented differently in different parts of the database
  • Data has ontology variations

    • Different labels are used to describe objects of the same type






Can Regular Expressions Help in Querying Irregular Data?

  • In many cases, regular expressions can be used to query irregular data

  • Yet, regular expressions are

    • Not efficient – it is difficult to evaluate regular expressions
    • Not intuitive – it is difficult for a naïve user to formulate regular expressions


More on Using Regular Expressions

  • When querying irregular data, the size of the regular expression could be exponential in the number of labels in the database

    • For n types of objects, there are n! possible hierarchies
    • For an object with n attributes, there are 2n subsets of missing attributes




Queries with Incomplete Answers

  • We have developed queries that deal with incomplete data in a novel way and return incomplete answers

  • The queries return maximal answers rather than complete answers

  • Different query semantics admit different levels of incompleteness





Queries and Matchings

  • The queries are labeled rooted directed graphs

  • Query nodes are variables

  • Matchings are assignments of database objects to the query variables according to

    • the constraints specified in the query, and
    • the semantics of the query


Constraints On Complete Matchings









The Reachability Constraint on Partial Matchings

  • A query node v that is mapped to a database object o satisfies the reachability constraint if there is a path from the query root to v, such that all edge constraints along this path are satisfied



“And” Matchings

  • A partial matching is an AND matching if

    • The root constraint is satisfied
    • The reachability constraint is satisfied by every query node that is mapped to a database node
    • If a query node is mapped to a database node, all the incoming edge constraints are satisfied






Weak Satisfaction of Edge Constraints



Weak Matchings

  • A partial matching is a weak matching if

    • The root constraint is satisfied
    • The reachability constraint is satisfied by every query node that is mapped to a database node
    • Every edge constraint is weakly satisfied








“OR” Matchings

  • A partial matching is an OR matching if

    • The root constraint is satisfied
    • The reachability constraint is satisfied by every query node that is mapped to a database node




Increasing Level of Incompleteness

  • A complete matching is an AND matching

  • An AND matching is a weak matching

  • A weak matching is an OR matching



Maximal Matchings

  • A tuple t1 subsumes a tuple t2 if t1 is the result of replacing some null values in t2 by non-null values:

  • A matching is maximal if no other matching subsumes it

  • A query result consists of maximal matchings only



On the Complexity of Computing Queries with Incomplete Answers

  • The size of the result can be exponential in the size of the input (database and query)

    • Note that the same is true when joining relations – the size of the result can be exponential in the size of the input (database and query)
  • Instead of using data complexity (where the runtime depends only on the size of the database), we use input-output complexity



Input-Output Complexity



The Motivation for Using I/O Complexity

  • Measuring the time complexity with respect to the size of the input does not separate between the following two cases:

    • An algorithm that does an exponential amount of work simply because the size of the output is exponential in the size of the input
    • An algorithm that does an exponential amount of work even when the query result is small
      • Either the algorithm is naïve (e.g., it unnecessarily computes subsumed matchings) or the problem is hard


I/O Complexity of Query Evaluation (lower bounds are for non-emptiness)



Filter Constraints

  • Constraints that filter the results (i.e., the maximal matchings)

  • There are

    • Weak filter constraints (the constraint is satisfied if a variable in the constraint is null)
    • Strong filter constraints (all variables must be non-null for satisfaction)
  • Existence constraint: !x is true if x is not null



I/O Complexity of Query Evaluation with Existence Constraints (lower bounds are for non-emptiness)



I/O Complexity of Query Evaluation with Weak Equality/Inequality Constraints (lower bounds are for non-emptiness)



Query Containment

  • Query containments for queries with incomplete answers is defined differently from query containment for queries with complete answers

  • Q1  Q2 if for all database D,

    • every matching of Q1 w.r.t. to D
    • is subsumed by
    • a matchings of Q2 w.r.t. to D
  • Query containment (query equivalence) is useful for the development of optimization techniques



Containment in AND Semantics

  • Homomorphism between the query graphs is necessary and sufficient for containment



Containment in OR Semantics

  • The following is a necessary and sufficient condition for query containment in OR semantics

  • For every spanning tree T1 of the contained query, there a spanning tree T2 of the containing query, such that there is a homomorphism from T2 to T1

    • is in ΠP2
    • NP-Complete if the containee is a tree
    • polynomial if the container is a tree


Containment in Weak Semantics

  • Similar to containment in OR Semantics, with the following difference

  • Instead of checking homomorphism between spanning trees, we check homomorphism between graph fragments

    • A graph fragment is a restriction of the query to a subset of the variables that includes the query root such that every node in the fragment is reachable from the root




Flexible Queries

  • To deal with structural variations in the data, we have developed flexible queries





Example





Constraints On Rigid Matchings





A Semiflexible Matching

  • The query root is mapped to the db root



A Semiflexible Matching

  • The query root is mapped to the db root









A Flexible Matching

  • The query root is mapped to the db root



An Example of a Flexible Query









Differences Between the Semiflexible and Flexible Semantics

  • On a technical level, in flexible matchings

    • Query paths are not necessarily embedded in database paths
    • SCC’s are not necessarily mapped to SCC’s
  • On a conceptual level, in the semiflexible semantics, nodes are “semantically related” if they are on the same path, and hence

    • Query paths are embedded in database paths
  • In the flexible semantics, this condition is relaxed:

    • Query edges are embedded in database paths


Increasing Level of Flexibility

  • A rigid matching is a semiflexible matching

  • A semiflexible matching is a flexible matching



Verifying that Mappings are Semiflexible Matchings

  • Is a given mapping of query nodes to database nodes a semiflexible matching?

    • Not as simple as for rigid matchings (no local test, i.e., need to consider paths rather than edges)
  • In a dag query, the number of paths may be exponential

    • Yet, verifying is in polynomial time
  • In a cyclic query, the number of paths may be infinite

    • Yet, verifying is in exponential time


Verifying that a Mapping is a Semiflexible Matching



Input-Output Complexity of Query Evaluation for the Semiflexible Semantics

  • Next slide summarizes results about the input-output complexity

    • Polynomial for a dag query and a tree database (or simpler cases)
      • Rather difficult to prove, even when the query is a tree, since there is no local test for verifying that mappings are semiflexible matchings
    • Exponential lower bounds for other cases


I/O Complexity for SF Semantics (lower bounds are for non-emptiness)



Query Evaluation for the Flexible Semantics

  • The database is replaced with a relationship graph which is a graph, such that

    • The nodes are the nodes of the database
    • Two nodes are connected by an edge if there is a path between them in the database (the direction of the path is unimportant)
  • The query is evaluated under rigid semantics w.r.t. the relationship graph



I/O Complexity of Query Evaluation for the Flexible Semantics

  • Results follow from a reduction to query evaluation under the rigid semantics

  • Tree query

    • Input-Output complexity is polynomial
  • DAG query

    • Testing for non-emptiness is NP-Complete


Query Containment

  • Q1  Q2 if for all database D,

    • the set of matchings of Q1 w.r.t. to D
    • is contained in
    • the set of matchings of Q2 w.r.t. to D
  • We assume that

    • Both queries have the same set of variables


Complexity of Query Containment

  • Under the semiflexible semantics, Q1  Q2 iff the identity mapping from the variables of Q2 to the variables of Q1 is a semiflexible matching of Q2 w.r.t. Q1

  • Thus, containment is

    • in coNP when Q1 is a cyclic graph and Q2 is either a dag or a cyclic graph
    • in polynomial time in all other cases
  • Under the flexible semantics, query containment is always in polynomial time



Database Equivalence

  • D1 and D2 are equivalent if for all queries Q,

    • the set of matchings of Q w.r.t. to D1
    • is equal to
    • the set of matchings of Q w.r.t. to D2
  • Both databases must have the same set of objects and the same root



Complexity of Database Equivalence

  • For the semiflexible semantics, deciding equivalence of databases is

    • in polynomial time if both databases are dags
    • in coNP if one of the databases has cycles
  • For the flexible semantics, deciding equivalence of databases is polynomial in all cases



Database Transformation



Transforming a Database into a Tree

  • Reasons for transforming a database into an equivalent tree database:

    • Evaluation of queries over a tree database is more efficient
    • In a graphical user interface, it is easier to represent trees than DAGs or cyclic graphs
    • Storing the data in a serial form (e.g., XML) requires no references


Transformation into a Tree

  • There are algorithms for

    • Testing if a database can be transformed into an equivalent tree database, and
    • Performing the transformation
  • For the semiflexible semantics

    • The algorithms are polynomial
  • For the flexible semantics



Implementing Flexible Queries

  • Flexible queries were implemented in SQL4X

  • In an SQL4X query, relations and XML documents are queried simultaneously

  • A query result can be either a relation or an XML document











Combining the Paradigms

  • In oblivious querying:

    • The user does not have to know where data is incomplete
    • The user does not have to know the exact structure of the data
  • The paradigm of flexible queries and the paradigm of queries with incomplete answers should be combined



Flexible Queries with Incomplete Answers

  • A flexible query w.r.t. a database is actually a rigid query w.r.t. the relationship graph

  • Evaluating a query in AND-semantics (weak semantics, OR-Semantics) w.r.t. the relationship graph produces a flexible query that returns maximal answers rather than complete answers









Full Disjunction

  • Intuitively, the full disjunction of a given set of relations is the join of these relations that does not discard dangling tuples

  • Dangling tuples are padded with nulls

  • Only maximal tuples are retained in the full disjunction (as in the case of QwIA)









In the Full Disjunction of a Given Set of Relations:



Motivation for Full Disjunctions

  • Full disjunctions have been proposed by Galiando-Legaria as an alternative for outerjoins [SIGMOD’94]

  • Rajaraman and Ullman suggested to use full disjunctions for information integration [PODS’96]



Computing Full Disjunctions for γ-acyclic Relation Schemas

  • Rajaraman and Ullman have shown how to evaluate the full disjunction by a sequence of natural outerjoins when the relation schemas are γ-acyclic

  • Hence, the full disjunction can be computed in polynomial time, under input-output complexity, when the relation schemas are γ-acyclic



Weak Semantics Generalizes Full Disjunctions

  • Relations can be converted into a semistructured database

  • The full disjunction can be expressed as the union of several queries that are evaluated under weak semantics



Generalizing Full Disjunctions

  • In a full disjunction, tuples are joined according to equality constraints as in a natural join (or equi-join)

  • We can generalize full disjunctions to support constraints that are not merely equality among attributes



Example



Another Way of Generalizing Full Disjunctions: Use OR-Semantics

  • OR-semantics is used rather than weak semantics when tuples are joined

  • This relaxes the requirement that every pair of tuples should be join consistent

  • Instead, a tuple of the full disjunction is only required to be generated by database tuples that form a connected subgraph, but need not be pairwise join consistent













Conclusion

  • Flexible and semiflexible queries facilitate easy and intuitive querying of semistructured databases

    • Querying the database even when the user is oblivious to the structure of the database
    • Queries are insensitive to variations in the structure of the database


Conclusion (continued)

  • Queries in AND semantics, OR semantics or weak semantics facilitate easy and intuitive querying of incomplete databases

    • Querying the database even when the user is oblivious to missing data
    • Queries return maximal answers rather than complete answers


Conclusion (continued)

  • The two paradigms of flexible queries and queries with maximal answers can be combined

  • The combination of the paradigms can facilitate integration of information from heterogeneous sources



Conclusion (continued)

  • Full disjunctions can be computed using queries in weak semantics

  • Full disjunctions can be generalized so that relations are joined using constraints that are not merely equality constraints






Do'stlaringiz bilan baham:


Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2019
ma'muriyatiga murojaat qiling