The MIT Laboratory for Information and Decision Systems has issued an interesting news release in which it advocates that artificial data can give the same results as real data without compromising privacy.
I have been working in the data / big data arena for a large number of years. As a result creating data to test software I have developed or systems I have built is nothing new. I have a large number of scripts that can create test data. I am not alone in this as most developers will also have scripts to easily create test data. In addition a number of companies exist that provide test data either as a free or paid for service for example Grid Tools.
The concept of test data is just there to test the system. It sees where the cracks are such as are the fields in the database set to the correct width and type. It can be used to verify that a given set of inputs produce the expected result. Overall the concept behind test data is to test functional data.
That is where this new tool from MIT really stands out from the crowd as it appears to do so much more. Not only does it test the functionality of the system but according to MIT the synthetic data can be used by data scientists to build models and get real meaning from the dataset without exposing potentially revealing private information.
The approach taken is to model the existing database through the use of machine learning. The model builds correlations between multiple fields to create a more realistic synthetic model. The structure of the source database is also modelled to keep the relationships between the data elements identical across both the real and synthetic data. The tests carried out by the team showed that there was no real difference in performance but the press release does not discuss the accuracy of the synthetic data against real data and this is something I’d like to see explored further.
The standard approach to building meaningful models with sensitive data is to use anonymisation techniques to remove potentially sensitive data or identifying elements of data. So once the data has been anonymised the rest of the data can be used to create a model. How the standard approach and the approach from MIT compare in the real world is still to be tested so watch this space.
Blog post by Richard Skeggs (Business Data Development Manager at the Business and Local Government Data Research Centre), please email us if you have any questions about the contents of this post.
Published 19 April 2017