In recent years, the phenomenal technological developments in information technology have led to an increase in the capability to store and record personal data about customers and individuals. This has led to concerns that the personal data may be misused for a variety of purposes. In order to alleviate these concerns, a number of techniques have been recently proposed in order to perform data mining tasks that are privacy-preserving. Thus the field of privacy has seen rapid advances in recent years and in the data mining environment have led to increased concerns about privacy. In this thesis, we develop efficient, effective and realistic methods in the privacy-preserving data mining field focusing on three core techniques, namely access control, data anonymization and statistical disclosure control. In Part I, this thesis presents a model for privacy preserving access control which is based on a variety of purposes. Conditional purpose is applied along with allowed purpose and prohibited purpose in the model. It allows users to use some data for certain purposes with conditions. The structure of the conditional purpose-based access control model (CPBAC) is defined and investigated through a practical paradigm with access purpose and intended purpose. An algorithm is developed to achieve the compliance computation between access purposes and intended purposes. According to this model, more information from data providers can be extracted while at the same time assuring privacy that maximizes the usability of consumers’ data. This model extends traditional access control models to a further coverage of privacy preservation in the data mining environment. Its interior is a new structure for managing collected data in an effective and trustworthy way. This structure helps enterprises to circulate clear privacy promises and to collect and manage user preferences and consent. Finally, we inject this model with the conventional well known role-based access control (RBAC) model as RBAC is still the most popular approach towards access control to achieve database security and is available in many DBMS. The notion of applying these mechanisms to allow web sites to publish a privacy policy, and implement more nuanced management of usage information and other personal information, ultimately allows (legitimate) use of information. In Part II, this thesis presents a systematic clustering based k-anonymization technique to minimize the information loss and at the same time assure data quality. The proposed technique adopts a system to group similar data together and then anonymize each group individually. The structure of systematic clustering problem is defined and investigated through paradigm and properties. An algorithm of the proposed problem is developed and it is shown that the time complexity is in O(n2/k), where n is the total number of records containing individuals and their private information. Experimental results show that the proposed method attains a reasonable dominance with respect to both information loss and execution time. A way out is also shown to illustrate the usability of the algorithm for incremental datasets. Finally we extend the systematic-clustering approach to the l-diversity model that assumes that every group of indistinguishable records contains at least l distinct sensitive attribute values. The whole procedure consists of the two steps, namely a clustering step for k-anonymization and an l-diverse step. In Part III, this thesis presents two heuristic algorithms for microdata protection in Statistical Disclosure Control (SDC). The first heuristic microaggregation algorithm works by partitioning the microdata into clusters of at least k records in a systematic way and then replacing the records in each cluster with the centroid of the cluster which we refer to systematic microaggregation for SDC. The structure of the systematic microaggregation problem is defined and investigated and an algorithm of the proposed problem is developed. Experimental results show that the systematic microaggregation attains a reasonable dominance with respect to both information loss and execution time than the most popular heuristic algorithm called Maximum Distance to Average Vector (MDAV). Finally it has shown that the systematic microaggregation is highly scalable. The second heuristic algorithm, called pairwise-systematic (P-S) microaggregation easily captures extreme values in the dataset and works by adopting simultaneously two distant groups at a time with the corresponding similar records together in a systematic way. Extensive experimental studies are conducted to show the efficiency and the effectiveness of the algorithm. The performance of the P-S algorithm is compared against the most recent microaggregation methods. Experimental results show that the P-S algorithm incurs significantly less information loss compared to the latest microaggregation methods for all of the test situations. Finally we propose a new microaggregation method where centroid is considered as median. The new method guarantees that the microaggregated data and the original data are similar by using a statistical test. |