The Role of Agents in Distributed Data Mining: Issues and Benefits


Download 445 b.
Sana03.09.2018
Hajmi445 b.


The Role of Agents in Distributed Data Mining: Issues and Benefits

  • Josenildo Costa da Silva 1, Matthias Klusch 1, Stefano Lodi 2, Gianluca Moro 2, Claudio Sartori 2

  • 1Deduction and Multiagent Systems,

  • German Research Center for Artificial Intelligence,

  • Saarbruecken, Germany

  • 2Department of Electronics, Computer Science and Systems,

  • University of Bologna,

  • Bologna, Italy


Distributed Data Mining (DDM)

  • Data sets

    • Massive
    • Inherently distributed
  • Networks

    • Limited bandwidth
    • Limited computing resources at nodes
  • Privacy and security

    • Sensitive data
    • Share goals, not data


Centralized solution

  • Apply traditional DM algorithms to data retrieved from different sources and stored in a data warehouse

  • May be impractical or even impossible for some business settings

    • Autonomy of data sources
    • Data privacy
    • Scalability (~TB/d)


Agents and DDM

  • DDM exploits distributed processing and problem decomposability

  • Is there any real added value of using concepts from agent technology in DDM?

    • Few DDM algorithms use agents
    • Evidence that cooperation among distributed DM processes may allow effective mining even without centralized control
    • Autonomy, adaptivity, deliberative reasoning naturally fit into the DDM framework


State of the Art

  • BODHI

    • Mobile agent platform/Framework for collective DM on heterogeneous sites
  • PADMA

    • Clustering homogeneous sites
    • Agent based text classification/visualization
  • JAM

    • Metalearning, classifiers
  • Papyrus

    • Wide area DDM over clusters
    • Move data/models/results to minimize network load


Agents for DDM (pros)

  • Autonomy of data sources

  • Scalability of DM to massive distributed data

  • Multi-strategy DDM

  • Collaborative DM



Agents for DDM (against)

  • Need to enforce minimal privileges at a data source

    • Unsolicited access to sensitive data
    • Eavesdropping
    • Data tampering
    • Denial of service attacks


The Inference Problem

  • Work in statistical DB (mid 70’s)

  • Integration/aggregation at the summary level is inherent in DDM

    • Infer sensitive data even from partial integration to a certain extent and with some probability (inference problem)
    • Existing DDM systems are not capable of coping with the inference problem


Data Clustering

  • Popular problem

    • Statistics (cluster analysis)
    • Pattern Recognition
    • Data Mining
  • Decompose multivariate data set into groups of objects



DE-clustering

  • Clustering based on non-parametric density estimation

    • Construct an estimate of the probability density function from the data set
    • Objects “attracted” by a local maximum of the estimate belong to the same cluster


Kernel Density Estimation

  • The higher the number of data objects in the neighbourhood of x, the higher density at x

  • A data object exerts more influence on the value of the estimate at x than any data object farther from x than xi

  • The influence of data objects is radial



Formalizing Density Estimators

  • The density estimate at a space object x is proportional to a sum of weights

  • The sum consists of one weight for every data object

  • Weight is a monotonically decreasing function (kernel ) of the distance between x and xi scaled by a factor h (window width )



Kernel Functions

  • Uniform kernel



Kernel Functions

  • Triangular kernel



Kernel Functions

  • Epanechnikov’s kernel



Kernel Functions

  • Gaussian kernel



Example (1/2)

  • Uniform kernel, h=250



Example (2/2)

  • Gaussian kernel, h=250



Distributed Data Clustering (1/2)

  • Clustering algorithm A( )

  • Homogeneous distributed data clustering problem for A:

    • Data set S
    • Sites Lj
    • Lj stores data set Dj with


Distributed Data Clustering (2/2)

  • Problem: find clustering Cj in the data space of Lj such that:

    • Cj agree with A(S) (correctness requirement):
    • Time/communication costs are minimized (efficiency requirement)
    • The size of data transferred out of the data space of any Lj is minimized (privacy requirement)


Traditional (centralized) solution

  • Gather all local data sets into one centralized repository (e.g., a data warehouse)

  • Run A( ) on the centralized data set

  • Unsatisfied privacy requirement

  • Unsatisfied efficiency requirement for some A( )



Sampling

  • Goal: given some class of functions of type

  • represent every member as a sampling series

  • where:

    • is a collection of points of
    • is some set of suitable expansion functions


Example

  • The class of polynomials of degree 1

    • Sampling points
    • Expansion functions
  • Finite sum



Band-limited Functions

  • Function f of one real variable

  • Range of frequencies of a function f  support of the Fourier transform of f

  • Any function whose range of frequencies is confined to a bounded set B is called band-limited to B (the band-region)



Example: sinc function



Sampling Theorem

  • If f is band-limited with band-region

  • then



Sampling Theorem (scaled multidimensional version)

  • Let

    • where is the -th component of a vector
  • If f is band-limited to B then



Sampling Density Estimates (1/4)

  • Additivity of density estimates of a distributed data set



Sampling Density Estimates (2/4)



Sampling Density Estimates (3/4)

  • Truncation errors

    • The support of a kernel function is not bounded in general
  • Aliasing errors

    • The support of the Fourier transform of a kernel function is not bounded in general  kernel functions are not band-limited


Sampling Density Estimates (4/4)

  • The sampling series of a density estimate can only be approximated

  • Trade-off between the number of samples and accuracy

    • Define a minimal multidimensional rectangle outside which samples are negligible
    • Define a vector of sampling intervals such that the aliasing error is negligible


The KDEC scheme

    • Computes a local density estimate of its data Dj


The KDEC scheme



The KDEC scheme



The KDEC scheme



The KDEC scheme



The KDEC scheme



Properties of the approach

  • Communication complexity depends only on the number of samples

  • Data objects are never transmitted over the network

  • Local clusters are close to global clusters which can be obtained using DE-cluster

  • Time complexity does not exceed the time complexity of centralized DE-clustering



Window width and sampling frequency

  • Good estimates when h is not less than a small multiple of the smallest distance between objects

  • As , the number of samples rarely exceeds the number of data points



Complexity

  • Site j

    • Sampling: O(q(N) Sam)
    • DE-cluster: O(|Dj|q(Dj))
  • Helper

    • Summation of samples: O(Sam)
  • Communication

    • Time: O(Sam)
    • Volume: O(M Sam)


Complexity (centralized approach)

  • Site j

    • Transmission/Reception of data objects: O(|Dj|)
  • Helper

    • Global DE-clustering: O(N q(N))
  • Communication:

    • Time: O(N)
    • Volume: O(N)


Stationary agent-based KDEC

  • The helper engages site agents to agree on:

    • Kernel function
    • Window width
    • Sampling frequencies
    • Sampling region
  • The global sampled form of the estimate is computed in a single stage



Mobile agent-based KDEC

  • At site Ln the visiting agent:

    • Negotiates kernel function, window width, sampling frequencies, sampling region
    • Carries the sum of samples collected at Lm, min its data space
  • The global sampled form of the estimate is returned to the interested agents



A Hierarchical Scheme

  • Additivity allows to extend the scheme to trees of arbitrary arity

  • Local sampled density estimates are propagated upwards in partial sums, until the global sampled DE is computed at the root and returned to the leaves



Inference and Trustworthiness

  • Inference problem for kernel density estimates

    • Goal of inference attacks: exploit information contained in a density estimate to infer the data objects
  • Trustworthiness of helpers

    • Trustworthy helper no bit of information written to memory by a process for the Helper procedure is sent to a system peripheral by a different process


Inference Attacks on Kernel Density Estimates

  • Let

  • be extensionally equal to a density estimate:

  • For example, g is the reconstructed density estimate (sampling series)



Inference Attacks on Kernel Density Estimates

  • Simple strategy: Search the density estimate or its derivatives for discontinuities

  • Example: The kernel is the square pulse

    • For each pair of projections of objects on an axis there is a pair of projections of discontinuities on that axis having the same distance as the objects’ projections
    • If h is known then the objects can be inferred easily
  • If the kernel has discontinuous derivatives, then the same technique applies to the derivatives



Inference Attacks on Kernel Density Estimates

  • If g is not continuous at x  an object lies at



Inference Attacks on Kernel Density Estimates

  • If the kernel is infinitely differentiable the problem is more difficult

  • Select space objects and attempt to solve a nonlinear system of equations



Attack Scenarios

  • Single-site attack

    • One of the sites attempts to infer the data objects from the global density estimate
    • Unable to associate a specific data object to a specific site
  • Site coalition attack

    • A coalition computes the sum of the density estimates of all the other sites as difference
    • Special case: the coalition includes all sites but one  the attack potentially reveals the data objects at the site


Untrustworthy Helpers

  • Reputation binary random variable with probabilities p and 1-p

    • p is the probability that the helper is untrustworthy
    • If the agent community supports referrals about an agent's reputation as a helper, then an agent might know per agent probabilities


Conclusions

  • Agent technology in DDM

    • Preserves the autonomy of data sources and scalability of the DM step
    • Privacy protection (inherent in most DDM approaches) may be less effective
  • Agent-based distributed density estimate computation & clustering

    • Scalable
    • Implementation based on mobile agents
    • Could be vulnerable to inference attacks on density estimates perpetrated by coalitions


Work in progress (1)

  • Inference attacks on sampled density estimates by solving the corresponding system of NL equations



Work in progress (2)

  • Inference attacks on sampled density estimates by iteratively reducing the dimensionality of the system of NL equations

    • pointhunt algorithm
    • Proceeds iteratively by
      • selecting a point x close to the “border” of the density estimate
      • guessing an object such that the object is the only contributor to density estimate at x


Work in progress (3)

  • Ontology and protocol

  • Algorithms for deliberative participation based on trustworthiness referrals

  • Probabilities that a local density estimate may be learned by another agent

    • Probability that no other agent learns the density estimate
    • Probability that at least k other agents learn the density estimate
    • Probability that exactly k other agents learn the density estimate


Future Work

  • Algorithms for the negotiation of parameters

  • Formalization of errors

    • Bounds on aliasing errors
    • Clustering errors (e.g., using an index of partition difference)






Do'stlaringiz bilan baham:


Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2019
ma'muriyatiga murojaat qiling