Which data are you using for software tests in a complex landscape if you have to test the behavior of an application cluster? When do you prefer anonymized data, when synthetic ones?
Synthetic data are our preferred choice for software development and isolated tests. However during more complex test cases we run into the limits of synthetic data. First at all they are ideal without errors and not organically grown like data in a productive environment. Only productive data can reflect the comprehensive spectrum of the evolution of the system landscape with all its mistakes and errors. Secondly, it is difficult to generate a sufficient amount of consistent test data over dozens of applicatons, necessary for more complex test cases. Here you would use anonymized data. But anonymized data are not anonymous. And the data volume is growing exponentially. The last factor makes it more and more difficult to handle the anonymization process in a timely and economic manner.
What are your experiences? In which test scenarios are you using which kind of test data? Where would you see the limits for testing with synthetic data today? Do you have an idea where to find studies about this topic?