Real Risks of Fake Data

Caute_cautim · ‎08-21-2024

Hi All

Responsible data development is at the core of Responsible AI (RAI). If a training dataset was created poorly (under-represented, skewed data) this will lead to a biased model. In AI development, using real data has privacy, ethical, and IP implications, to name a few. On the other hand, using synthetic (AI-generated) data is not a panacea (as much as it’s been hailed). It leads to other kinds of downstream issues that need to be taken into account.

This paper explores two key risks of using synthetic data in AI model development:
1. Diversity-washing (synthetic data can give the appearance of diversity)
2. Consent circumvention (consent stops being a “procedural hook” that limits downstream harms from AI model use and this – along with data source obfuscation - complicates enforcement)

The paper focuses on facial recognition technology (FRT) highlighting the risks of using synthetic data, and the trade-offs between utility, fidelity, and privacy. It’s important to develop participatory governance models along with data lineage and transparency which are crucial when it comes to mitigating these risks.

https://media.licdn.com/dms/document/media/D4E1FAQEQEHlQ7cTbAA/feedshare-document-pdf-analyzed/0/172...

Regards

Caute_Cautim