Microsoft, a renowned tech giant, recently experienced a major data leak incident that has raised questions about the security of cloud storage services like AWS and Azure. This incident has underlined the inherent risk of using insecure cloud storages and testing applications with real data.
Let’s delve into the details of the incident, explore the potential risks of unsecured cloud storages, and understand how monitoring can help in preventing such leaks.
The Incident: A Brief Overview
Microsoft’s AI research team inadvertently exposed a whopping 38TB of the company’s private data. This was due to a misconfiguration in a Shared Access Signature (SAS) token, a feature provided by Microsoft Azure for sharing data stored in its cloud services.
This SAS token was linked to a GitHub repository containing open-source code and AI models for image recognition. However, the link was not correctly configured, resulting in public access to the entire Azure storage account associated with the SAS token.
The exposed data included sensitive information such as passwords to Microsoft services, secret keys, and over 30,000 internal Microsoft Teams messages.
Is 38 TB of internal data exposed in a breach a lot?
Yes, 38 TB (terabytes) of internal data being stolen in a breach is a significant amount. To put this into perspective:
- Text Data: A single terabyte can store about 312,500,000 pages of plain text. 38 TB would be approximately 11,875,000,000 pages of plain text.
- Documents: If we consider an average size for a document (like a Word file) to be around 2 MB, then a terabyte can store around 500,000 such documents. 38 TB would be about 19,000,000 documents.
- High-Resolution Images: High-resolution photos from modern cameras can range from 5 to 25 MB in size. Even at the larger end of that range, 38 TB could hold over a million high-resolution photos.
- Videos: A one-hour HD video might be about 1-2 GB in size. Thus, 38 TB could store thousands of hours of HD video content.
In the context of a data breach:
- Volume: The sheer volume of 38 TB implies that the stolen data could include a combination of sensitive documents, emails, databases, media files, software, and more.
- Data Sensitivity: The impact of such a breach would depend on the nature of the data stolen. For instance, if the stolen data includes personal, financial, or proprietary information, the consequences could be dire.
- Implications: Beyond the immediate data loss, the organization could face reputational damage, legal repercussions, regulatory fines, and potential financial losses. For individuals whose data might be part of the breach, there could be implications related to privacy invasion, identity theft, and personal security.
- Detection and Recovery: Handling and analyzing such a large volume of data for post-breach investigations would be resource-intensive and time-consuming. Recovery efforts, both in terms of technical solutions and public relations, would be substantial.
In summary, 38 TB of data being stolen is indeed a lot, and it suggests a significant and serious breach that would entail considerable challenges and implications for any affected organization.
The Root Cause: Misconfigured SAS Tokens
Microsoft’s Azure provides SAS tokens as a means for users to grant specific access to specific files and resources in their storage accounts. These tokens can be configured to provide specific permissions, define the duration of access, and even limit access to specific IP addresses.
However, in this case, the SAS token was configured to provide full access to the entire storage account, instead of just the intended files. This misconfiguration made all the data stored in the account publicly accessible.
Moreover, SAS tokens do not expire by default, making them a potential security risk if not managed properly. It is recommended to use short-lived SAS tokens and apply the principle of least privilege wherever possible.
The Risk of Insecure Cloud Storages
This incident highlights the inherent risk associated with using cloud storage services such as AWS, Azure, and Google Cloud. Misconfigured storage buckets have become a common backdoor for data leaks.
While these services offer robust security features, they also require a certain level of technical know-how to configure them effectively. Default deployments often lack the necessary level of controls, and unless these are explicitly enabled by the administrators, data stored in these services may be left vulnerable.
Moreover, the use of real data for testing applications can also lead to data leaks, especially if the data isn’t adequately protected or if the testing environment isn’t properly isolated.
The Solution: Monitoring and Governance
One of the effective ways to prevent such data leaks is through diligent monitoring and governance. Security teams need to have visibility into the creation and usage of SAS tokens and other similar mechanisms. They also need to be able to monitor and govern the permissions granted through these tokens. For instance, Azure provides monitoring and logging capabilities that can be used to track the usage of SAS tokens. These logs can be analyzed to detect any anomalous activities or potential security risks.
Furthermore, security teams should work closely with the data science and research teams to ensure proper guardrails are defined and followed during the process of data sharing. Code sharing platforms like GitHub are commonly used by developers to share and collaborate on code. However, they can also pose a security risk if not used properly. In the case of the Microsoft data leak, the SAS token was shared on a public GitHub repository. Anyone who came across this repository could access the Azure storage account linked to the token. Therefore, it’s crucial to monitor the usage of code sharing platforms and ensure that sensitive data or credentials are not inadvertently shared on these platforms.
Preventing Leaks: Best Practices
In light of the incident, here are some best practices to prevent such data leaks:
- Least Privilege Access: Always follow the principle of least privilege when granting access to resources. Ensure that users have only the permissions they need and nothing more.
- Regular Audits: Conduct regular audits of your cloud storage accounts to detect any misconfigurations or security risks.
- Use Short-Lived Tokens: Use short-lived SAS tokens or similar mechanisms to provide temporary access to resources.
- Monitoring and Logging: Utilize the monitoring and logging capabilities provided by your cloud service provider to track the usage of SAS tokens and other similar mechanisms. You can also use Kaduu’s Code & Cloud Storage Monitoring Solution.
- Education and Training: Educate your employees about the risks associated with sharing sensitive data or credentials on public platforms like GitHub. Provide training on how to use these platforms securely.
- Separate Storage for Public Data: If you need to share data publicly, consider creating a separate storage account for this purpose. This can help prevent inadvertent exposure of private data.
Why A.I. Needs So Much Data for Training in the first place?
- Generalization: A fundamental goal of AI is to make predictions or decisions on new, previously unseen data. Training on a large and diverse dataset ensures that the model has seen a variety of scenarios, enabling it to generalize better to new data.
- Model Complexity: Modern deep learning models can have millions or even billions of parameters. To optimize these parameters without overfitting, substantial amounts of data are required. Overfitting is when a model learns the training data too well, including its noise and outliers, which can lead to poor performance on new data.
- Feature Learning: Unlike traditional machine learning models where features are hand-engineered, deep learning models learn features directly from the data. This process requires more data to effectively learn important and intricate features.
- Reducing Bias: If trained only on small, non-representative datasets, models can become biased or unfair. A broader dataset can help in capturing a wide array of scenarios, reducing the chance of unintentional biases.
What are the other Risks of Exposing Training Data on Cloud Storages?
- Privacy Concerns: If the training data contains personal or sensitive information, unauthorized access to it can lead to serious privacy breaches.
- Intellectual Property: The data might be proprietary, giving a business or research group a competitive advantage. Unauthorized access can lead to the theft of this intellectual property.
- Data Tampering: If attackers gain access to the data, they might tamper with it, leading to the training of compromised or malicious AI models.
- Regulatory Violations: Exposing certain types of data, especially personally identifiable information (PII), can lead to violations of regulations like the General Data Protection Regulation (GDPR) or the California Consumer Privacy Act (CCPA).
- Misinterpretation: People without the proper context might misinterpret the raw training data, leading to misunderstandings or incorrect conclusions.
Is this the first time real client data is used for testing puposes?
There have been instances where test databases or environments for applications contained sensitive data, which was inadvertently exposed. While these mistakes don’t always make global headlines like massive data breaches do, they are significant concerns for affected individuals and the organizations responsible for the oversight. Here are a few examples:
- Accenture: The consulting giant left multiple Amazon Web Services (AWS) S3 storage buckets unsecured. Among the exposed data were internal access keys and credentials, customer data, and data stored for testing purposes. Some of the buckets appeared to be for “inner app” use, suggesting they were test or demo environments.
- U.S. Army Intelligence and Security Command: A researcher discovered an unsecured AWS S3 bucket that contained highly sensitive data related to the Army’s Distributed Common Ground System. Some of the data seemed to be related to a virtual machine, likely used for training or development testing.
- ElasticSearch Instances : Security researchers have found multiple instances where companies inadvertently left their ElasticSearch databases, which were used for testing or development purposes, exposed without any authentication. This kind of error has led to the exposure of personal information of millions of users in various incidents.
- Various MongoDB Instances: There have been numerous reports over the years of unsecured MongoDB databases being exposed on the internet. In many cases, these databases were test or development instances that organizations failed to properly secure, leading to data leaks.
- Application Logs and Stack Traces: While not a “database” per se, it’s not uncommon for application logs or verbose error messages (which might be more prevalent in test or development versions of software) to inadvertently expose sensitive information, including credentials, internal IP addresses, or even user data.
These incidents underscore the importance of ensuring that test and development environments are secured, even if they’re intended for internal use only. It’s essential that any sensitive data used in these environments be sanitized or obfuscated to prevent potential exposure.
Conclusion
The Microsoft data leak incident serves as a stark reminder of the risks associated with using cloud storage services and testing applications with real data. It underscores the importance of proper configuration, diligent monitoring, and good security practices in preventing such leaks.
As the adoption of cloud services and AI models grows, so does the need for robust security measures. By following best practices and leveraging the security features provided by cloud service providers, organizations can significantly reduce the risk of data leaks and ensure the security of their data.
Leave a Reply