Skip to content

Commit fa759ec

Browse files
committed
up
1 parent 9bf9a9c commit fa759ec

1 file changed

Lines changed: 11 additions & 10 deletions

File tree

content/en/docs/pages/bigcode-openrail.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ toc: true
1515
We are releasing [StarCoder](https://huggingface.co/bigcode/starcoder) and [StarCoderBase](https://huggingface.co/bigcode/starcoderbase), which are licensed under the [BigCode OpenRAIL-M license agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement), as we initially stated [here](https://www.bigcode-project.org/docs/about/ip/) and in our membership form.
1616

1717
## What is an OpenRAIL license agreement?
18-
Open Responsible AI Licenses (OpenRAIL) are licenses designed to permit free and open access, re-use, and downstream distribution of modifications of AI artifacts, for research, commercial or non-commercial purposes, as long as the use restrictions present in the license agreement always apply (including to modified material). For more information, please access the RAIL Initiative post.
18+
Open Responsible AI Licenses (OpenRAIL) are licenses designed to permit free and open access, re-use, and downstream distribution of modifications of AI artifacts, for research, commercial or non-commercial purposes, as long as the use restrictions present in the license agreement always apply (including to modified material). For more information, please access the RAIL Initiative [post](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses).
1919

2020
## Why Open?
2121
The term “Open” is to indicate that the license agreement enables royalty free access, downstream use and re-distribution of the licensed material, and distribution of any modifications of it.
@@ -24,18 +24,18 @@ You can use any type of legal agreement you want to share the model or modificat
2424
the license agreement paragraph related to the responsible use of the model (i.e., paragraph 4) or a similar provision that accomplishes the same purpose, and,
2525
the use restrictions (i.e., Attachment A) or a similar set of restrictions that accomplish the same purpose.
2626

27-
See FAQ below: “Do use restrictions apply to modificatins of the model?”.
27+
See FAQ below: “Do use restrictions apply to modifications of the model?”.
2828

2929
## Responsible?
3030
Responsible AI licensing is a mechanism that is part of -and interacting with- a broader system of AI governance instruments and processes, such as Model Cards and regulations.
3131

3232
The BigCode OpenRAIL-M license agreement is designed to promote responsible downstream use and sharing of the model by including a set of use restrictions for which the model cannot be used. In the case of the BigCode OpenRAIL-M, the restrictions are mainly inspired by BigScience’s approach to the licensing of LLMs, and also include specific use cases where we believe code generation models could be used as a harmful instrument due to either the intent of the user or the technical limitations of the model.
3333

3434
## ‍What does the “M” stand for?
35-
“M” stands for “Model”, which is the artifact being licensed under this license agreement and subject to use restrictions. See the naming conventions for RAIL Licenses here.
35+
“M” stands for “Model”, which is the artifact being licensed under this license agreement and subject to use restrictions. See the naming conventions for RAIL Licenses [here](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses).
3636

3737
## Is this an Open Source license?
38-
This is not an open source license according to the Open Source Initiative definition, because it has some restrictions on the use of the model.
38+
This is not an open source license according to the [Open Source Initiative](https://opensource.org/osd) definition, because it has some restrictions on the use of the model.
3939

4040
## What are we sharing?
4141
We are sharing a Large Language Model for code generation -StarCoder- developed under the BigCode project. Any source code relevant to the BigCode ML models is licensed under the Apache 2.0 license.
@@ -59,6 +59,7 @@ Restriction (c) forbidding the use of the Model to generate and/or disseminate m
5959

6060
## When defining “Output”, what do you mean by the “results of operating the Model”?
6161
The results of operating the model can be understood - but not limited to - as the data generated by the model in a specific format. For example, the code generated by the model given a specific prompt, or, the code suggestion generated by the model when embedded in a co-pilot UI.
62+
6263
## Can you give examples of “Modifications of the Model”?
6364
Modifications to the Model include - but are not limited to:
6465
- Quantization, fine-tuning, prompt-tuning, using adaptors, and other related methods for updating a model. This includes, use of intermediate data representations via distillation methods
@@ -68,7 +69,7 @@ Modifications to the Model include - but are not limited to:
6869
## What’s the scope of “Explanatory Documentation”?
6970
Explanatory Documentation refers to any type of documentation or technical specifications accompanying the Model or Modifications to the Model. For example, model cards are one of the most -if not the most- used documentation format when distributing a model. Model cards describe general and key characteristics of the model, such as performance, technical limitations, or data sources. Tools such as model cards (or data cards) are essential to foster AI documentation along the value chain and access to basic core information of ML artifacts, such as models or datasets.
7071

71-
Some users may not be familiar with model cards; thus other formats/ways to document the model key characteristics are also valid. For example, Meta has its AI system cards, while Microsoft has its Transparency Notes. Another type of technical specification that you would normally attach to your model when distributing and/or commercializing it also works (papers, tech specs, etc.).
72+
Some users may not be familiar with model cards; thus other formats/ways to document the model key characteristics are also valid. For example, Meta has its [AI system cards](https://ai.facebook.com/tools/system-cards/), while Microsoft has its [Transparency Notes](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/transparency-note?tabs=text). Another type of technical specification that you would normally attach to your model when distributing and/or commercializing it also works (papers, tech specs, etc.).
7273

7374
## Can you give other examples on the “Use” of the Model?
7475
“Use” of the Model is anything that has to do with operating the model, other examples besides the one given in the agreement can be: an end-use application such as a chatbot or code co-pilot UI.
@@ -103,7 +104,7 @@ Explanatory Documentation of the same or better quality means information that,
103104
For example, a finetuned version of StarCoder (i.e. a modification of the model) will have to include in its model card or documentation the same sections and accuracy of information as in the StarCoder original model card, and in addition, document the modifications made to the model.
104105

105106
## Can the output generated by the Model be subject to 3rd party rights?
106-
Yes. The Output You generate using the Model may be subject to third party rights or licenses, including open source licenses. This is why we developed the StarCoder: Dataset Search, in order for users to check whether the code generated by the model belongs to an existing open source repository, and if so, check the license and comply with it.
107+
Yes. The Output You generate using the Model may be subject to third party rights or licenses, including open source licenses. This is why we developed the [StarCoder: Dataset Search](https://huggingface.co/spaces/bigcode/search), in order for users to check whether the code generated by the model belongs to an existing open source repository, and if so, check the license and comply with it.
107108

108109
## Do I have to disclaim that the outputtted code snippets were generated by a model?
109110
According to restriction (f) you cannot use the model to generate code or distribute code without intelligibly disclaiming that the code was generated by the model. What does this mean in practice? If you finetune a BigCode model, embed it into an app designed to be a code assistant, and plan to distribute the app as a SaaS, you shall make it clear for users of your app that the code generated by it is generated by a model. It might sound obvious for you, but not for others. It is important to be transparent with your users.
@@ -157,10 +158,10 @@ Yes, you must accept the license agreement before you can access the model.
157158
Community feedback is essential for us. We are grateful to receive it and answer any questions related to the license agreement. Drafting use restrictions that everyone agrees on and understands for ML models is a challenge. There is a balance to be struck between restrictions being too generic, on the one hand, and restrictions being too narrow and not covering potentially harmful scenarios, on the other hand. We are open to your comments.
158159

159160
## What other RAILs are out there?
160-
Since the release of the BLOOM RAIL license in May 2022, there has been a proliferation of RAILs based on either the structure and content of the latter or just the aim of promoting responsible use of ML models, as OpenAI does with its Usage Policies or Meta does with its licenses for OPT-175, BB3, SEER, LLaMA. Other RAIL licenses based on the BLOOM RAIL have been released, such as BigScience OpenRAIL-M, CreativeML OpenRAIL-M, SIL RAIL-M 1.0, and OpenRAIL++M.
161+
Since the [release](https://bigscience.huggingface.co/blog/the-bigscience-rail-license) of the BLOOM RAIL [license](https://huggingface.co/spaces/bigscience/license) in May 2022, there has been a proliferation of RAILs based on either the structure and content of the latter or just the aim of promoting responsible use of ML models, as OpenAI does with its [Usage Policies](https://openai.com/policies/usage-policies) or Meta does with its licenses for [OPT-175B](https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.md), [BB3](https://github.com/facebookresearch/ParlAI/blob/main/parlai/zoo/bb3/MODEL_LICENSE.md), [SEER](https://github.com/facebookresearch/vissl/blob/main/projects/SEER/MODEL_LICENSE.md), [LLaMA](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform). Other RAIL licenses based on the BLOOM RAIL have been released, such as [BigScience OpenRAIL-M](https://www.licenses.ai/blog/2022/8/26/bigscience-open-rail-m-license), [CreativeML OpenRAIL-M](https://huggingface.co/spaces/CompVis/stable-diffusion-license), [SIL RAIL-M 1.0](https://huggingface.co/spaces/sil-ai/model-license), and [OpenRAIL++M](https://www.ykilcher.com/license).
161162

162163
## Is there more information?
163-
Yes! Please have a look at the RAIL Initiative content.
164+
Yes! Please have a look at the [RAIL Initiative](https://www.licenses.ai/) content.
164165
Co-authors of the BigCode OpenRAIL-M license agreement:
165166
Carlos Muñoz Ferrandis, Jennifer Robinson, Luis Villa, Danish Contractor, Jenny Lee, Thierry Paul, Sean Hugues, Leandro de Werra, Harm de Vries, Huu Nguyen, Yacine Jernite, David Lansky, Chris Akiki.
166167

@@ -171,10 +172,10 @@ Yes. The BigCode OpenRAIL-M (i.e. the document itself) is licensed under a CC-BY
171172

172173
## What if the model outputs code snippets belonging to an already existing repository under a permissive license? What about providing attribution and license notice?
173174

174-
Besides the opt out tool and process we developed for the community (“Am I in the Stack?”), we developed StarCoder: Dataset Search, a tool enabling users to identify whether the code generated by the model belongs to a code repository, in order for the user to be able to inspect licensing requirements and comply with them when using and distributing the code.
175+
Besides the opt out tool and process we developed for the community (“[Am I in the Stack?](https://huggingface.co/spaces/bigcode/in-the-stack)”), we developed [StarCoder: Dataset Search](https://huggingface.co/spaces/bigcode/search), a tool enabling users to identify whether the code generated by the model belongs to a code repository, in order for the user to be able to inspect licensing requirements and comply with them when using and distributing the code.
175176

176177
## What if the data used to train the model includes malicious code that was not previously detected in the dataset?
177-
It is important that researchers and developers can trust the source of data used to train a model. Please, only use models and datasets from sources that you trust. The model cards should specify which dataset(s) or subsets of it were used. The data card of the dataset used to train the models should be reviewed so that the developer understands the limitations and social impact of using the dataset - at the time of writing, 142 parquet files in The Stack had been detected (using ClamAV) and marked as harmful, and developers are cautioned about using these files. Since new malware signatures for data contained in the dataset may only become known after the dataset was released, users should consider doing their own scan of the dataset using the most current anti-virus malware software before using the model. Users could also consider using a tool other than ClamAV to provide a second layer of scanning. If new malware is detected, please report these findings to contact@bigcode-project.org.
178+
It is important that researchers and developers can trust the source of data used to train a model. Please, only use models and datasets from sources that you trust. The model cards should specify which dataset(s) or subsets of it were used. The [data card](https://huggingface.co/datasets/bigcode/the-stack) of the dataset used to train the models should be reviewed so that the developer understands the limitations and social impact of using the dataset - at the time of writing, 142 parquet files in The Stack had been detected (using ClamAV) and marked as harmful, and developers are cautioned about using these files. Since new malware signatures for data contained in the dataset may only become known after the dataset was released, users should consider doing their own scan of the dataset using the most current anti-virus malware software before using the model. Users could also consider using a tool other than ClamAV to provide a second layer of scanning. If new malware is detected, please report these findings to contact@bigcode-project.org.
178179

179180
Furthermore, code generated using the model should be scanned by the developer to check for malware before it is used (as the malware could impact the users' own machines) and/or distributed, as the dissemination of malware could be considered a criminal activity, and users are responsible for the inputs (prompts) and outputs (generated text/code) from using the model or an application powered by the model.
180181

0 commit comments

Comments
 (0)