Shanghai’s police database is for sale in what could be China’s biggest ever data breach. China is home to 1.4 billion people, which means the data breach could potentially affect more than 70% of the population.

Unknown hackers claimed to have stolen data of nearly one billion Chinese residents after breaching a Shanghai police database, they are selling more than 33 terabytes of stolen data from the database for 10 bitcoin (around 200,000 US$). The database includes names, addresses, birthplaces, national IDs, phone numbers as well as criminal case information. The hackers claimed the database was hosted on the Cloud accessible to anyone without restrictions.

What is a Data Leak ?

A Data Leak is an unauthorized data transfer from within a company to a third party. It can happen in many ways such as through emails or file transfers as well as unauthorized physical access to devices like Cloud Storages, Laptops, USB keys etc...

In a nutshell, we can group data leaks into two categories:

• External : Hackers targeting a company’s IT infrastructure and stealing its data.

• Internal : Could be accidental or not.

o Accidental Data Leaks occurs when an employee or a partner makes a security mistake, for example, sending by email confidential information to the wrong recipient. Accidental data leaks also includes the loss of a company’s laptop or using weak passwords.

o Internal Threats could be employees or partners purposely leaking information as a revenge or simply selling sensitive data. Data leaks occur most often as a result of internal or accidental leaks. They open a Pandora’s box of risks for companies such as insurance premium increases, lawsuits, regulatory fines and media embarrassment.

What is the real reason for this Data Leak?

The hackers of this massive data leak claimed the database was hosted on the Cloud accessible to anyone without restrictions. Cyber security experts are stating that this breach could be the largest ever in the country’s history. They say it is not unusual to find databases that are left open to public access. Unsecured PII (Personal Identifiable Information) exposed through leaks, breaches, or some form of incompetence, is an increasingly common problem faced by companies and governments around the world.

If this very sensitive database has been stored in the cloud without proper security measures, it is probably due to the fact that real data has been shared by internal team members, such as :

• IT Partners.

• Developers/testers for analytics projects.

• Data scientists to train new AI/ML models (AI : Artificial Intelligence, ML : Machine Learning).

How can your company avoid such Data Leak?

To reduce the risk of Data Leakages, companies are usually following data security best practices such as zero-trust, security by design, getting non-disclosure agreement signed by employees and partners, securing endpoint devices as well as monitoring data access, etc…

CloudTDMS.com, a No-Code Cloud solution specialised in Test Data Management, recommends to any company to “protect companies by NEVER using real data for on-going project’s development or testing phases & environments”.

The forgotten root cause of many Data Leaks : Production/Real Data (all or subsets) being shared with IT partners, developers & testers for ongoing projects

With the rise of remote working as well as collaboration with on-shore & offshore partners, data breaches are happening more frequently than ever before, so no company can assume it won’t happen to them. One advice from CloudTDMS.com often forgotten in day to day data projects: “Never share real/production data or provide access to production databases to project’s team members”. Even if, for example, an important dashboard is requested urgently by the company’s CEO, even if a data project team does need urgently real data to train a new AI/ML model or during dev/test/qa phases. Production/Real Data is the easiest option for all data project’s team members BUT it’s the most dangerous option & not the only solution ! Some companies may think that production data encryption shall solve the problem, but the devil is in the details, because simply it’s impossible to encrypt all systems & data at all times, data is being decrypted somehow/sometimes during workflow processing, it’s also impossible to prove encryption security is working all the time and finally encryption gives a false sense of security.

Join the new era of Synthetic / Realistic Data ! It is becoming the new fuel for all Data related Projects. In other words any company can make test data as Synthetic/Realistic without taking it from production

Gartner predicts that by 2024, 60% of the data used for the de-vel-op-ment of AI and an-a-lyt-ics projects will be syn¬thet¬i-cally gen¬er-ated. Forrester recommends synthetic data to accelerate the development of new AI solutions, improve the accuracy of AI models, and protect sensitive data. It is currently being used in autonomous vehicles, financial services, insurance and pharmaceutical firms, and computer vision vendors.

How could companies make IT projects secure and fully compliant with security policies and regulations such as GDPR ?

To fuel any IT project you will need to create synthetic/realistic data in order to address main challenges such as : • Regulatory compliance: Over 55% of companies are not fully compliant with Data Privacy Policies due to access to all or subset of production data by development & test teams. The impact could be extremely high with GDPR penalties range from 2% of annual revenue or €10 million, whichever is the greater, to 4% or €20 million, depending on the severity of the breach. • Test data generation: Up to 45% of development/test time is spent manually generating test data. This severely impacts the productivity of the teams as well as the "Time to Market". • Data discovery/profiling: Up to 85% of data is still profiled manually, resulting in incomplete and inconsistent data. • Automation: Over 70% of test data is still created manually in IT projects.

So, to make your IT projects secure and fully compliant with regulations, you can generate synthetic data either by using open source python libraries such as Faker or a No-Code Cloud Solution offering free plans such as CloudTDMS.com.

Firstly, Faker is a powerful python library that generates fake data, and is very simple to use. This in-house development approach works but to generate data using faker requires coding skills, another drawback of this approach is that it is not configurable, in instance, for each new object/table/file, the script needs to be modified completely.

On the other hand, CloudTDMS.com solution is a No-Code platform having all necessary functionalities required for test data management such as :

• Generating data that is synthetic/realistic just like production data,

• Data modelling with many accelerators to speedup this important step in any data project,

• Built-in integrations with databases & cloud solutions such as AWS-S3, AWS-Redshift, Google-Drive, DropBox, MySQL, Oracle, Salesforce or ServiceNow,

• Management of data repositories (Data Foundation),

• Data discovery & profiling, allowing collaboration between team members (data architects, data scientists, developers & testers),

• Data Protection and Masking. CloudTDMS.com is offering a free plan called "Starter plan".

Don’t Become a Headline !

With Synthetic Data approach, and for free, you could fuel any data project & protect your company. You can achieve this either with in-house development with open-source tools such as Python Faker library or by using a No-Code Cloud Solution such as CloudTDMS.com.

In a nutshell, one advice : Fake it until you make it !