You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/en/docs/pages/bigcode-openrail.md
+11-10Lines changed: 11 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@ toc: true
15
15
We are releasing [StarCoder](https://huggingface.co/bigcode/starcoder) and [StarCoderBase](https://huggingface.co/bigcode/starcoderbase), which are licensed under the [BigCode OpenRAIL-M license agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement), as we initially stated [here](https://www.bigcode-project.org/docs/about/ip/) and in our membership form.
16
16
17
17
## What is an OpenRAIL license agreement?
18
-
Open Responsible AI Licenses (OpenRAIL) are licenses designed to permit free and open access, re-use, and downstream distribution of modifications of AI artifacts, for research, commercial or non-commercial purposes, as long as the use restrictions present in the license agreement always apply (including to modified material). For more information, please access the RAIL Initiative post.
18
+
Open Responsible AI Licenses (OpenRAIL) are licenses designed to permit free and open access, re-use, and downstream distribution of modifications of AI artifacts, for research, commercial or non-commercial purposes, as long as the use restrictions present in the license agreement always apply (including to modified material). For more information, please access the RAIL Initiative [post](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses).
19
19
20
20
## Why Open?
21
21
The term “Open” is to indicate that the license agreement enables royalty free access, downstream use and re-distribution of the licensed material, and distribution of any modifications of it.
@@ -24,18 +24,18 @@ You can use any type of legal agreement you want to share the model or modificat
24
24
the license agreement paragraph related to the responsible use of the model (i.e., paragraph 4) or a similar provision that accomplishes the same purpose, and,
25
25
the use restrictions (i.e., Attachment A) or a similar set of restrictions that accomplish the same purpose.
26
26
27
-
See FAQ below: “Do use restrictions apply to modificatins of the model?”.
27
+
See FAQ below: “Do use restrictions apply to modifications of the model?”.
28
28
29
29
## Responsible?
30
30
Responsible AI licensing is a mechanism that is part of -and interacting with- a broader system of AI governance instruments and processes, such as Model Cards and regulations.
31
31
32
32
The BigCode OpenRAIL-M license agreement is designed to promote responsible downstream use and sharing of the model by including a set of use restrictions for which the model cannot be used. In the case of the BigCode OpenRAIL-M, the restrictions are mainly inspired by BigScience’s approach to the licensing of LLMs, and also include specific use cases where we believe code generation models could be used as a harmful instrument due to either the intent of the user or the technical limitations of the model.
33
33
34
34
## What does the “M” stand for?
35
-
“M” stands for “Model”, which is the artifact being licensed under this license agreement and subject to use restrictions. See the naming conventions for RAIL Licenses here.
35
+
“M” stands for “Model”, which is the artifact being licensed under this license agreement and subject to use restrictions. See the naming conventions for RAIL Licenses [here](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses).
36
36
37
37
## Is this an Open Source license?
38
-
This is not an open source license according to the Open Source Initiative definition, because it has some restrictions on the use of the model.
38
+
This is not an open source license according to the [Open Source Initiative](https://opensource.org/osd) definition, because it has some restrictions on the use of the model.
39
39
40
40
## What are we sharing?
41
41
We are sharing a Large Language Model for code generation -StarCoder- developed under the BigCode project. Any source code relevant to the BigCode ML models is licensed under the Apache 2.0 license.
@@ -59,6 +59,7 @@ Restriction (c) forbidding the use of the Model to generate and/or disseminate m
59
59
60
60
## When defining “Output”, what do you mean by the “results of operating the Model”?
61
61
The results of operating the model can be understood - but not limited to - as the data generated by the model in a specific format. For example, the code generated by the model given a specific prompt, or, the code suggestion generated by the model when embedded in a co-pilot UI.
62
+
62
63
## Can you give examples of “Modifications of the Model”?
63
64
Modifications to the Model include - but are not limited to:
64
65
- Quantization, fine-tuning, prompt-tuning, using adaptors, and other related methods for updating a model. This includes, use of intermediate data representations via distillation methods
@@ -68,7 +69,7 @@ Modifications to the Model include - but are not limited to:
68
69
## What’s the scope of “Explanatory Documentation”?
69
70
Explanatory Documentation refers to any type of documentation or technical specifications accompanying the Model or Modifications to the Model. For example, model cards are one of the most -if not the most- used documentation format when distributing a model. Model cards describe general and key characteristics of the model, such as performance, technical limitations, or data sources. Tools such as model cards (or data cards) are essential to foster AI documentation along the value chain and access to basic core information of ML artifacts, such as models or datasets.
70
71
71
-
Some users may not be familiar with model cards; thus other formats/ways to document the model key characteristics are also valid. For example, Meta has its AI system cards, while Microsoft has its Transparency Notes. Another type of technical specification that you would normally attach to your model when distributing and/or commercializing it also works (papers, tech specs, etc.).
72
+
Some users may not be familiar with model cards; thus other formats/ways to document the model key characteristics are also valid. For example, Meta has its [AI system cards](https://ai.facebook.com/tools/system-cards/), while Microsoft has its [Transparency Notes](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/transparency-note?tabs=text). Another type of technical specification that you would normally attach to your model when distributing and/or commercializing it also works (papers, tech specs, etc.).
72
73
73
74
## Can you give other examples on the “Use” of the Model?
74
75
“Use” of the Model is anything that has to do with operating the model, other examples besides the one given in the agreement can be: an end-use application such as a chatbot or code co-pilot UI.
@@ -103,7 +104,7 @@ Explanatory Documentation of the same or better quality means information that,
103
104
For example, a finetuned version of StarCoder (i.e. a modification of the model) will have to include in its model card or documentation the same sections and accuracy of information as in the StarCoder original model card, and in addition, document the modifications made to the model.
104
105
105
106
## Can the output generated by the Model be subject to 3rd party rights?
106
-
Yes. The Output You generate using the Model may be subject to third party rights or licenses, including open source licenses. This is why we developed the StarCoder: Dataset Search, in order for users to check whether the code generated by the model belongs to an existing open source repository, and if so, check the license and comply with it.
107
+
Yes. The Output You generate using the Model may be subject to third party rights or licenses, including open source licenses. This is why we developed the [StarCoder: Dataset Search](https://huggingface.co/spaces/bigcode/search), in order for users to check whether the code generated by the model belongs to an existing open source repository, and if so, check the license and comply with it.
107
108
108
109
## Do I have to disclaim that the outputtted code snippets were generated by a model?
109
110
According to restriction (f) you cannot use the model to generate code or distribute code without intelligibly disclaiming that the code was generated by the model. What does this mean in practice? If you finetune a BigCode model, embed it into an app designed to be a code assistant, and plan to distribute the app as a SaaS, you shall make it clear for users of your app that the code generated by it is generated by a model. It might sound obvious for you, but not for others. It is important to be transparent with your users.
@@ -157,10 +158,10 @@ Yes, you must accept the license agreement before you can access the model.
157
158
Community feedback is essential for us. We are grateful to receive it and answer any questions related to the license agreement. Drafting use restrictions that everyone agrees on and understands for ML models is a challenge. There is a balance to be struck between restrictions being too generic, on the one hand, and restrictions being too narrow and not covering potentially harmful scenarios, on the other hand. We are open to your comments.
158
159
159
160
## What other RAILs are out there?
160
-
Since the release of the BLOOM RAIL license in May 2022, there has been a proliferation of RAILs based on either the structure and content of the latter or just the aim of promoting responsible use of ML models, as OpenAI does with its Usage Policies or Meta does with its licenses for OPT-175, BB3, SEER, LLaMA. Other RAIL licenses based on the BLOOM RAIL have been released, such as BigScience OpenRAIL-M, CreativeML OpenRAIL-M, SIL RAIL-M 1.0, and OpenRAIL++M.
161
+
Since the [release](https://bigscience.huggingface.co/blog/the-bigscience-rail-license) of the BLOOM RAIL [license](https://huggingface.co/spaces/bigscience/license) in May 2022, there has been a proliferation of RAILs based on either the structure and content of the latter or just the aim of promoting responsible use of ML models, as OpenAI does with its [Usage Policies](https://openai.com/policies/usage-policies) or Meta does with its licenses for [OPT-175B](https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.md), [BB3](https://github.com/facebookresearch/ParlAI/blob/main/parlai/zoo/bb3/MODEL_LICENSE.md), [SEER](https://github.com/facebookresearch/vissl/blob/main/projects/SEER/MODEL_LICENSE.md), [LLaMA](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform). Other RAIL licenses based on the BLOOM RAIL have been released, such as [BigScience OpenRAIL-M](https://www.licenses.ai/blog/2022/8/26/bigscience-open-rail-m-license), [CreativeML OpenRAIL-M](https://huggingface.co/spaces/CompVis/stable-diffusion-license), [SIL RAIL-M 1.0](https://huggingface.co/spaces/sil-ai/model-license), and [OpenRAIL++M](https://www.ykilcher.com/license).
161
162
162
163
## Is there more information?
163
-
Yes! Please have a look at the RAIL Initiative content.
164
+
Yes! Please have a look at the [RAIL Initiative](https://www.licenses.ai/) content.
164
165
Co-authors of the BigCode OpenRAIL-M license agreement:
165
166
Carlos Muñoz Ferrandis, Jennifer Robinson, Luis Villa, Danish Contractor, Jenny Lee, Thierry Paul, Sean Hugues, Leandro de Werra, Harm de Vries, Huu Nguyen, Yacine Jernite, David Lansky, Chris Akiki.
166
167
@@ -171,10 +172,10 @@ Yes. The BigCode OpenRAIL-M (i.e. the document itself) is licensed under a CC-BY
171
172
172
173
## What if the model outputs code snippets belonging to an already existing repository under a permissive license? What about providing attribution and license notice?
173
174
174
-
Besides the opt out tool and process we developed for the community (“Am I in the Stack?”), we developed StarCoder: Dataset Search, a tool enabling users to identify whether the code generated by the model belongs to a code repository, in order for the user to be able to inspect licensing requirements and comply with them when using and distributing the code.
175
+
Besides the opt out tool and process we developed for the community (“[Am I in the Stack?](https://huggingface.co/spaces/bigcode/in-the-stack)”), we developed [StarCoder: Dataset Search](https://huggingface.co/spaces/bigcode/search), a tool enabling users to identify whether the code generated by the model belongs to a code repository, in order for the user to be able to inspect licensing requirements and comply with them when using and distributing the code.
175
176
176
177
## What if the data used to train the model includes malicious code that was not previously detected in the dataset?
177
-
It is important that researchers and developers can trust the source of data used to train a model. Please, only use models and datasets from sources that you trust. The model cards should specify which dataset(s) or subsets of it were used. The data card of the dataset used to train the models should be reviewed so that the developer understands the limitations and social impact of using the dataset - at the time of writing, 142 parquet files in The Stack had been detected (using ClamAV) and marked as harmful, and developers are cautioned about using these files. Since new malware signatures for data contained in the dataset may only become known after the dataset was released, users should consider doing their own scan of the dataset using the most current anti-virus malware software before using the model. Users could also consider using a tool other than ClamAV to provide a second layer of scanning. If new malware is detected, please report these findings to contact@bigcode-project.org.
178
+
It is important that researchers and developers can trust the source of data used to train a model. Please, only use models and datasets from sources that you trust. The model cards should specify which dataset(s) or subsets of it were used. The [data card](https://huggingface.co/datasets/bigcode/the-stack) of the dataset used to train the models should be reviewed so that the developer understands the limitations and social impact of using the dataset - at the time of writing, 142 parquet files in The Stack had been detected (using ClamAV) and marked as harmful, and developers are cautioned about using these files. Since new malware signatures for data contained in the dataset may only become known after the dataset was released, users should consider doing their own scan of the dataset using the most current anti-virus malware software before using the model. Users could also consider using a tool other than ClamAV to provide a second layer of scanning. If new malware is detected, please report these findings to contact@bigcode-project.org.
178
179
179
180
Furthermore, code generated using the model should be scanned by the developer to check for malware before it is used (as the malware could impact the users' own machines) and/or distributed, as the dissemination of malware could be considered a criminal activity, and users are responsible for the inputs (prompts) and outputs (generated text/code) from using the model or an application powered by the model.
0 commit comments