Resampling methods for generating continuous multivariate synthetic data for disclosure control
Article
Article Title | Resampling methods for generating continuous multivariate synthetic data for disclosure control |
---|---|
ERA Journal ID | 213192 |
Article Category | Article |
Authors | Khan, Atikur R. (Author) and Kabir, Enamul (Author) |
Journal Title | Journal of Data, Information and Management |
Journal Citation | 3, pp. 225-235 |
Number of Pages | 11 |
Year | 2021 |
Publisher | Springer |
Place of Publication | Germany |
ISSN | 2524-6356 |
2524-6364 | |
Digital Object Identifier (DOI) | https://doi.org/10.1007/s42488-021-00054-2 |
Web Address (URL) | https://link.springer.com/article/10.1007%2Fs42488-021-00054-2 |
Abstract | Sharing microdata within or outside of an organization may lead to the disclosure of sensitive information of an individual. Data stewarding organizations often disseminate synthetic data to reduce the likelihood of disclosure of sensitive information. Synthetic data can be generated from posterior predictive distributions, however, finding a distribution in multidimensional space is not straight forward. If a distribution function is correctly estimated, synthetic data generated from the estimated distribution will hold all statistical properties of the original data. In practice, distribution functions are unknown and estimation of distribution function under some assumptions may result in a synthetic data set that does not hold statistical properties of the original data. This paper develops synthetic data generating methods based on resampling from singular vectors and eigenvalues without requiring estimation of posterior predictive distribution function for the data matrix. Methods developed in this paper have been implemented to generate continuous multivariate synthetic data, and performances of these methods are studied by comparing the disclosure risk and information loss measures. A rectangular cuboid is also constructed from the lower quartiles of information loss and disclosure risk measures, and selection of synthetic data from this rectangular cuboid is found to reduce the disclosure risk and information loss of these methods further. |
Keywords | synthetic data; disclosure control |
ANZSRC Field of Research 2020 | 460599. Data management and data science not elsewhere classified |
Public Notes | Files associated with this item cannot be displayed due to copyright restrictions. |
Byline Affiliations | North South University, Bangladesh |
School of Sciences | |
Institution of Origin | University of Southern Queensland |
https://research.usq.edu.au/item/q6qx1/resampling-methods-for-generating-continuous-multivariate-synthetic-data-for-disclosure-control
123
total views6
total downloads3
views this month0
downloads this month