In recent years, the phenomenal technological developments in information technology have led to an increase in the capability to store and record personal data about customers and individuals. This has led to concerns that the personal data may be misused for a variety of purposes. In order to alleviate these concerns, a number of techniques have been recently proposed in order to perform data mining tasks that are privacy-preserving. Thus the field of privacy has seen rapid advances in recent years and in the data mining environment have led to increased
concerns about privacy. In this thesis, we develop efficient, effective and realistic methods in the privacy-preserving data mining field focusing on three core
techniques, namely access control, data anonymization and statistical disclosure control.
In Part I, this thesis presents a model for privacy preserving access control which is based on a variety of purposes. Conditional purpose is applied along
with allowed purpose and prohibited purpose in the model. It allows users to use some data for certain purposes with conditions. The structure of the conditional
purpose-based access control model (CPBAC) is defined and investigated through a practical paradigm with access purpose and intended purpose. An algorithm is developed to achieve the compliance computation between access
purposes and intended purposes. According to this model, more information from data providers can be extracted while at the same time assuring privacy that maximizes the usability of consumers’ data. This model extends traditional
access control models to a further coverage of privacy preservation in the data mining environment. Its interior is a new structure for managing collected data
in an effective and trustworthy way. This structure helps enterprises to circulate clear privacy promises and to collect and manage user preferences and consent.
Finally, we inject this model with the conventional well known role-based access control (RBAC) model as RBAC is still the most popular approach towards access
implement more nuanced management of usage information and other personal information, ultimately allows (legitimate) use of information.
In Part II, this thesis presents a systematic clustering based k-anonymization technique to minimize the information loss and at the same time assure data quality. The proposed technique adopts a system to group similar data together
and then anonymize each group individually. The structure of systematic clustering problem is defined and investigated through paradigm and properties. An
algorithm of the proposed problem is developed and it is shown that the time complexity is in O(n2/k), where n is the total number of records containing individuals
and their private information. Experimental results show that the proposed method attains a reasonable dominance with respect to both information loss and execution time. A way out is also shown to illustrate the usability of the
algorithm for incremental datasets. Finally we extend the systematic-clustering approach to the l-diversity model that assumes that every group of indistinguishable
records contains at least l distinct sensitive attribute values. The whole procedure consists of the two steps, namely a clustering step for k-anonymization and an l-diverse step.
In Part III, this thesis presents two heuristic algorithms for microdata protection in Statistical Disclosure Control (SDC). The first heuristic microaggregation algorithm works by partitioning the microdata into clusters of at least k records in a systematic way and then replacing the records in each cluster with the centroid of the cluster which we refer to systematic microaggregation for SDC. The
structure of the systematic microaggregation problem is defined and investigated and an algorithm of the proposed problem is developed. Experimental results show that the systematic microaggregation attains a reasonable dominance with respect to both information loss and execution time than the most popular heuristic algorithm called Maximum Distance to Average Vector (MDAV). Finally it
has shown that the systematic microaggregation is highly scalable.
The second heuristic algorithm, called pairwise-systematic (P-S) microaggregation easily captures extreme values in the dataset and works by adopting simultaneously
two distant groups at a time with the corresponding similar records together in a systematic way. Extensive experimental studies are conducted to show the efficiency and the effectiveness of the algorithm. The performance of the P-S algorithm is compared against the most recent microaggregation methods. Experimental results show that the P-S algorithm incurs significantly less information
loss compared to the latest microaggregation methods for all of the test situations. Finally we propose a new microaggregation method where centroid is considered as median. The new method guarantees that the microaggregated
data and the original data are similar by using a statistical test.