+It is important that researchers and developers can trust the source of data used to train a model. Please, only use models and datasets from sources that you trust. The model card should specify which dataset(s) or subsets of it were used. The data card of the dataset used to train the model should be reviewed so that the developer understands the limitations and social impact of using the dataset - at the time of writing, 142 files in The Stack had been detected (using ClamAV) and marked as harmful, and developers are cautioned about using these files. Since new malware signatures for data contained in the dataset may only become known after the dataset was released, users should consider doing their own scan of the dataset using the most current anti-virus malware software before using the model. Users could also consider using a tool other than ClamAV to provide a second layer of scanning. If new malware is detected, please report these findings to [contact@bigcode-project.org](mailto:contact@bigcode-project.org).
0 commit comments