update model license

harm-devries · harm-devries · commit 0b1c03ebff7d · 2022-12-22T09:19:04.000+01:00
diff --git a/content/en/docs/pages/model-license.md b/content/en/docs/pages/model-license.md
@@ -1,5 +1,5 @@
 ---
-title: "Model License - FAQ"
+title: "The Model License"
 description: ""
 lead: ""
 date: 2020-11-12T15:22:20+01:00
@@ -82,4 +82,15 @@ Yes! Please have a look at the RAIL Initiative [FAQ](https://www.licenses.ai/faq
 
 Besides the opt out tool and process we developed for the community (“[Am I in the Stack?](https://huggingface.co/spaces/bigcode/in-the-stack)”), we are currently working on tools enabling users to identify whether the code generated by the model belongs to a code repository, in order for the user to be able to inspect licensing requirements and comply with them when using and distributing the code.
 
+## What if the data used to train the model includes malicious code that was not previously detected in the dataset?
+
+It is important that researchers and developers can trust the source of data used to train a model. Please, only use models and datasets from sources that you trust. The model card should specify which dataset(s) or subsets of it were used. The data card of the dataset used to train the model should be reviewed so that the developer understands the limitations and social impact of using the dataset - at the time of writing, 142 files in The Stack had been detected (using ClamAV) and marked as harmful, and developers are cautioned about using these files. Since new malware signatures for data contained in the dataset may only become known after the dataset was released, users should consider doing their own scan of the dataset using the most current anti-virus malware software before using the model. Users could also consider using a tool other than ClamAV to provide a second layer of scanning. If new malware is detected, please report these findings to [contact@bigcode-project.org](mailto:contact@bigcode-project.org).
+Furthermore, code generated using the model should be scanned by the developer to check for malware before it is used (as the malware could impact the users' own machines) and/or distributed, as the dissemination of malware could be considered a criminal activity, and users are responsible for the inputs (prompts) and outputs (generated text/code) from using the model or an application powered by the model. Developers should include this kind of code review and risk assessment throughout their software development lifecycle. Developers are ultimately responsible for adhering to practices for secure coding when building downstream applications or working with generated code. There are likely additional considerations not covered in this FAQ, and users should consult a security expert to review the unique aspects of their own projects.
+Lastly, developers should inform end-users of their applications that there is a risk of malware, and take the necessary precautions such as using up-to-date anti-virus malware software. End-users of code generated by a downstream application from the developer should also be aware of their responsibilities to scan the code before using it and/or before they publish their own downstream applications. 
+
+## What if the model hallucinates sensitive data in its output?
+It is possible that a large language model can hallucinate and output sensitive data such as Personally Identifiable Information (PII). This is clearly an unintended output, but it is a real risk that should be managed deliberately and with care. While the dataset may already have been screened for PII, some PII may not have been completely or accurately detected, annotated, or anonymized. As with malware, users of the model are responsible for adhering to laws and best practices pertaining to the handling and processing of PII in their own use of the model.
+
+Contact details:
+BigCode: [contact@bigcode-project.org](mailto:contact@bigcode-project.org)