Maximizing the ROI of Large Language Models for the Large Enterprise
Emerging large language model (LLM) applications present a clear opportunity for the large enterprise. However, the risks involved in rapid and unchecked deployment of LLMs have already proven to be severe. Just last week, Open AI accidentally exposed the titles of some of its users’ ChatGPT conversations without obtaining proper consent, which is now being examined for breach of the GDPR. While the evaluation of this potential breach is still ongoing, the GDPR and related data privacy fines can be severe for large enterprises (Amazon was fined $877 million in 2021 for GDPR violations and Didi, a Chinese ride-hailing company, was fined a whopping $1.2 billion in 2022 for Chinese cybersecurity violations). Beyond compliance risks, little stands in the way of employees at large enterprises from accidentally sharing company secrets or confidential information with 3rd party LLM services.
If large enterprises are to successfully deploy LLMs at scale and maximize ROI on their LLM investments, they must develop an implementation strategy that carefully addresses the risks tied to LLMs.
In this article, you will learn about:
- Several critical risk factors that enterprises should be aware of when leveraging LLM services.
- The risks involved in leaking highly confidential or private enterprise data and the severity of potential compliance violations.
- A path towards developing privacy-preserving and regulation-compliant LLM services that mitigates these risks.
Key privacy and compliance challenges when leveraging LLMs for the large enterprise:
Challenge 1: Employees querying 3rd Party APIs and LLM services can unknowingly share sensitive PII and company secrets with these 3rd Party LLM services.
While employees are typically bound by NDAs, a large enterprise with 10,000+ employees can often struggle to ensure 100% workforce adherence to confidentiality agreements. The rampant use of 3rd party LLM querying services by the public presents an immediate risk that employees may accidentally share confidential company information or regulated PII with 3rd Party LLM APIs that are hosted on servers or VPCs belonging to 3rd parties. For example, consider an engineer who wants to summarize the body of a confidential patent application to generate an abstract. This engineer could query 3rd party LLM APIs to perform this task using the query below. The engineer’s employer can only hope that this 3rd party service has implemented rigorous data deletion and data controls to ensure this confidential data is properly secured and removed upon request. For now, such hopes may only be wishful thinking. In the recent issue at Open AI, Twitter users claimed they had chat titles that included personal and “highly sensitive” information.
Our Solution of Choice: To ensure sensitive data is retained and properly managed according to your company’s data governance standards, it is highly advisable that large enterprises invest in deploying their own local LLM services entirely within their VPC or on-prem servers. These locally deployed LLM services would ensure employees can query LLMs to augment their workflow in an environment that is entirely contained within the enterprise’s VPC, while mitigating the risk of uncontrolled proliferation of sensitive and regulated enterprise data to 3rd party services. These enterprises will need to develop LLMs with sufficient performance to entice employees to use their locally deployed services. To achieve competitive performance, enterprises will need to fine-tune models for targeted tasks on large volumes of task-specific data (see Challenge 2 below).
Challenge 2: Training and fine-tuning performant LLMs for enterprise applications requires the collection of massive (and potentially sensitive) datasets.
Today’s state-of-the-art LLMs are trained on 500+ GB of data from books, webtexts, and articles and fine-tuned on additional data. In order for large enterprises to develop their own LLM services optimized for their specific workloads (i.e. for the creation of bespoke company reports or the summarization of routine documents), these enterprises will have to collect and aggregate large volumes of training data relevant for their applications. Large enterprises have the opportunity to leverage large volumes of data generated in-house within their organization, or potentially capturing data from partners and clients they work with. However, large enterprises commonly struggle to aggregate data across siloed datasets within even their own organization. For example, a global enterprise may struggle to centrally aggregate data generated by branches located in different jurisdictions. Likewise, large financial institutions commonly face large barriers to sharing data between internal teams (i.e. between wealth management divisions, commercial, and retail banking business units). The sensitive nature of these unstructured text data may include company secrets or customer PII that needs to be carefully tracked and monitored for GDPR compliance.
Our Solution of Choice: Federated learning methodologies can be highly effective in addressing data collection challenges. With federated learning, large enterprises can train and fine-tune LLM models across a series of siloed datasets (or data held by 3rd party partners or customers), without ever needing to collect raw training data on a central machine learning server. Federated learning is already an approach that has been deployed at scale by Apple and Google and is rapidly becoming an essential tool for large enterprise AI initiatives as 80% of global enterprises are investigating federated learning by 2024.
Challenge 3: LLM models by themselves present serious data leakage issues.
It has long been established in academic circles that severe data leakage risks exist for deep neural network models trained on sensitive datasets. For example, simple methods can be employed to reconstruct sensitive training data that was used to train LLMs from LLM models – even when the attacker has zero access to the training dataset.
Some of the more prominent data leakage vulnerabilities pertaining to LLMs are described below:
Unintended Data Memorization: Consider the case where a generative LLM model routinely outputs the exact text used in one of its training documents. A paper in 2021 described a statistical technique that identifies which LLM outputs have a high likelihood of being exact memorized samples from an LLM training dataset. The authors were able to extract 604 unique memorized training examples through simply performing a series of random queries of a GPT-2 API, including 46 examples of PII (people’s names, phone-numbers, and contact information).
Model Inversion Attacks: Similar to unintended memorization, model inversion is a powerful technique for reconstructing training data when an attacker only has access to the LLM model file. A team recently announced a model inversion attack on text data used to train a GPT-2 model that was able to successfully reconstruct training data text spans.
Membership Inference Attacks: When performing a membership inference attack, the attacker asks the question, “was this specific data point used in the training dataset”. Whereas the previous two attacks randomly generate datapoints with a high likelihood of being in the original training dataset, the membership inference attack can specifically query whether a target phrase was included in the training data. This makes this membership inference particularly powerful, and a major liability when training LLMs on sensitive datasets. Concerningly, a recent paper found that common forms of LLMs (called masked language models) “are extremely susceptible to likelihood ratio membership inference attacks”.
Our Solution of Choice: Differential privacy is a well studied technique to reduce effectiveness of these data leakage attacks by adding noise and limiting weight updates (“clipping gradients”) during the model training process. By adding noise to the training process, enterprises add an element of “plausible deniability”: the outputs of the above three attacks are generated using a noised process, so the attacker cannot claim with confidence that the resulting output belongs to the original training data. Moreover, differential privacy substantially degrades the performance of inversion and membership attacks and has been shown to mitigate the memorization vulnerabilities.
How large enterprises can develop a robust strategy for leveraging LLMs.
The specific implementation strategy to pursue depends strongly on the targeted LLM use-case (i.e. using LLMs for summarization, clause generation, entity recognition, etc. each require unique strategies for safeguarding company data). However, as LLMs pervade a growing number of applications, it is advisable that enterprises consider adopting several of the above solutions to mitigate and control the described risks.
DynamoFL provides large enterprises with the tools to rapidly implement the above solutions to drive more immediate ROI on their LLM projects (see below) while ensuring regulatory compliance and safeguarding user privacy.
We are strong proponents of the positive and transformative impact LLMs can have on society, and stand ready to assist enterprises in deploying these capabilities at scale, while complying to data governance and regulatory standards.
About the Author:
Vaikkunth Mugunthan, Ph.D. is the CEO and Cofounder of DynamoFL. Vaik completed his Ph.D. in Computer Science at M.I.T., where he developed techniques for privacy-preserving machine learning. At DynamoFL, Vaik leads a team of machine learning Ph.D.s and privacy experts form MIT, Harvard, and Berkeley with the goal of building the infrastructure for regulation-compliant AI.