The Role of Agents in Distributed Data Mining: Issues and Benefits Josenildo Costa da Silva 1, Matthias Klusch 1, Stefano Lodi 2, Gianluca Moro 2, Claudio Sartori 2 1Deduction and Multiagent Systems, German Research Center for Artificial Intelligence, Saarbruecken, Germany 2Department of Electronics, Computer Science and Systems, University of Bologna, Bologna, Italy
Distributed Data Mining (DDM) Data sets  Massive
 Inherently distributed
Networks  Limited bandwidth
 Limited computing resources at nodes
Privacy and security  Sensitive data
 Share goals, not data
Centralized solution Apply traditional DM algorithms to data retrieved from different sources and stored in a data warehouse May be impractical or even impossible for some business settings  Autonomy of data sources
 Data privacy
 Scalability (~TB/d)
Agents and DDM DDM exploits distributed processing and problem decomposability Is there any real added value of using concepts from agent technology in DDM?  Few DDM algorithms use agents
 Evidence that cooperation among distributed DM processes may allow effective mining even without centralized control
 Autonomy, adaptivity, deliberative reasoning naturally fit into the DDM framework
State of the Art BODHI  Mobile agent platform/Framework for collective DM on heterogeneous sites
PADMA  Clustering homogeneous sites
 Agent based text classification/visualization
JAM  Metalearning, classifiers
Papyrus  Wide area DDM over clusters
 Move data/models/results to minimize network load
Agents for DDM (pros) Autonomy of data sources Scalability of DM to massive distributed data Multistrategy DDM Collaborative DM
Agents for DDM (against) Need to enforce minimal privileges at a data source  Unsolicited access to sensitive data
 Eavesdropping
 Data tampering
 Denial of service attacks
The Inference Problem Work in statistical DB (mid 70’s) Integration/aggregation at the summary level is inherent in DDM  Infer sensitive data even from partial integration to a certain extent and with some probability (inference problem)
 Existing DDM systems are not capable of coping with the inference problem
Data Clustering Popular problem  Statistics (cluster analysis)
 Pattern Recognition
 Data Mining
Decompose multivariate data set into groups of objects
DEclustering Clustering based on nonparametric density estimation  Construct an estimate of the probability density function from the data set
 Objects “attracted” by a local maximum of the estimate belong to the same cluster
Kernel Density Estimation The higher the number of data objects in the neighbourhood of x, the higher density at x A data object exerts more influence on the value of the estimate at x than any data object farther from x than xi The influence of data objects is radial
Formalizing Density Estimators The density estimate at a space object x is proportional to a sum of weights The sum consists of one weight for every data object Weight is a monotonically decreasing function (kernel ) of the distance between x and xi scaled by a factor h (window width )
Kernel Functions
Kernel Functions
Kernel Functions
Example (1/2)
Example (2/2)
Distributed Data Clustering (1/2) Clustering algorithm A( ) Homogeneous distributed data clustering problem for A:  Data set S
 Sites Lj
 Lj stores data set Dj with
Distributed Data Clustering (2/2) Problem: find clustering Cj in the data space of Lj such that:  Cj agree with A(S) (correctness requirement):
 Time/communication costs are minimized (efficiency requirement)
 The size of data transferred out of the data space of any Lj is minimized (privacy requirement)
Traditional (centralized) solution Gather all local data sets into one centralized repository (e.g., a data warehouse) Run A( ) on the centralized data set Unsatisfied privacy requirement Unsatisfied efficiency requirement for some A( )
Sampling represent every member as a sampling series where:  is a collection of points of
 is some set of suitable expansion functions
Example The class of polynomials of degree 1  Sampling points
 Expansion functions
Finite sum
Bandlimited Functions Function f of one real variable Range of frequencies of a function f support of the Fourier transform of f Any function whose range of frequencies is confined to a bounded set B is called bandlimited to B (the bandregion)
Example: sinc function
Sampling Theorem If f is bandlimited with bandregion then
Sampling Theorem (scaled multidimensional version) Let 


 where is the th component of a vector
If f is bandlimited to B then
Sampling Density Estimates (1/4) Additivity of density estimates of a distributed data set
Sampling Density Estimates (2/4)
Sampling Density Estimates (3/4) Truncation errors  The support of a kernel function is not bounded in general
Aliasing errors  The support of the Fourier transform of a kernel function is not bounded in general kernel functions are not bandlimited
Sampling Density Estimates (4/4) The sampling series of a density estimate can only be approximated Tradeoff between the number of samples and accuracy  Define a minimal multidimensional rectangle outside which samples are negligible
 Define a vector of sampling intervals such that the aliasing error is negligible
The KDEC scheme  Computes a local density estimate of its data Dj
The KDEC scheme
The KDEC scheme
The KDEC scheme
The KDEC scheme
The KDEC scheme
Properties of the approach Communication complexity depends only on the number of samples Data objects are never transmitted over the network Local clusters are close to global clusters which can be obtained using DEcluster Time complexity does not exceed the time complexity of centralized DEclustering
Good estimates when h is not less than a small multiple of the smallest distance between objects As , the number of samples rarely exceeds the number of data points
Complexity Site j  Sampling: O(q(N) Sam)
 DEcluster: O(Djq(Dj))
Helper  Summation of samples: O(Sam)
Communication  Time: O(Sam)
 Volume: O(M Sam)
Complexity (centralized approach) Site j  Transmission/Reception of data objects: O(Dj)
Helper  Global DEclustering: O(N q(N))
Communication:
Stationary agentbased KDEC The helper engages site agents to agree on:  Kernel function
 Window width
 Sampling frequencies
 Sampling region
The global sampled form of the estimate is computed in a single stage
Mobile agentbased KDEC At site Ln the visiting agent:  Negotiates kernel function, window width, sampling frequencies, sampling region
 Carries the sum of samples collected at Lm, min its data space
The global sampled form of the estimate is returned to the interested agents
A Hierarchical Scheme Additivity allows to extend the scheme to trees of arbitrary arity Local sampled density estimates are propagated upwards in partial sums, until the global sampled DE is computed at the root and returned to the leaves
Inference and Trustworthiness Inference problem for kernel density estimates  Goal of inference attacks: exploit information contained in a density estimate to infer the data objects
Trustworthiness of helpers  Trustworthy helper no bit of information written to memory by a process for the Helper procedure is sent to a system peripheral by a different process
Inference Attacks on Kernel Density Estimates Let be extensionally equal to a density estimate: For example, g is the reconstructed density estimate (sampling series)
Inference Attacks on Kernel Density Estimates Simple strategy: Search the density estimate or its derivatives for discontinuities Example: The kernel is the square pulse  For each pair of projections of objects on an axis there is a pair of projections of discontinuities on that axis having the same distance as the objects’ projections
 If h is known then the objects can be inferred easily
If the kernel has discontinuous derivatives, then the same technique applies to the derivatives
Inference Attacks on Kernel Density Estimates If g is not continuous at x an object lies at
Inference Attacks on Kernel Density Estimates Select space objects and attempt to solve a nonlinear system of equations
Attack Scenarios Singlesite attack  One of the sites attempts to infer the data objects from the global density estimate
 Unable to associate a specific data object to a specific site
Site coalition attack  A coalition computes the sum of the density estimates of all the other sites as difference
 Special case: the coalition includes all sites but one the attack potentially reveals the data objects at the site
Untrustworthy Helpers Reputation binary random variable with probabilities p and 1p  p is the probability that the helper is untrustworthy
 If the agent community supports referrals about an agent's reputation as a helper, then an agent might know per agent probabilities
Conclusions Agent technology in DDM  Preserves the autonomy of data sources and scalability of the DM step
 Privacy protection (inherent in most DDM approaches) may be less effective
Agentbased distributed density estimate computation & clustering  Scalable
 Implementation based on mobile agents
 Could be vulnerable to inference attacks on density estimates perpetrated by coalitions
Work in progress (1) Inference attacks on sampled density estimates by solving the corresponding system of NL equations
Work in progress (2) Inference attacks on sampled density estimates by iteratively reducing the dimensionality of the system of NL equations  pointhunt algorithm
 Proceeds iteratively by
 selecting a point x close to the “border” of the density estimate
 guessing an object such that the object is the only contributor to density estimate at x
Work in progress (3) Ontology and protocol Algorithms for deliberative participation based on trustworthiness referrals Probabilities that a local density estimate may be learned by another agent  Probability that no other agent learns the density estimate
 Probability that at least k other agents learn the density estimate
 Probability that exactly k other agents learn the density estimate
Future Work Algorithms for the negotiation of parameters Formalization of errors  Bounds on aliasing errors
 Clustering errors (e.g., using an index of partition difference)
Do'stlaringiz bilan baham: 