Commit 3a261da
committed
fix: improve regex for detecting language codes in NllbTokenizer
- Updated the regex pattern from /^[a-z]{3}_[A-Z]{3}$/ to /^[a-z]{3}_[a-zA-Z]{3,4}$/ to accommodate additional language code formats such as `eng_Latn` used by some models like `Xenova/nllb-200-distilled-600M`.
- This change ensures better compatibility without significant penalties for false positives.
- Thanks to @Thorry84 for the suggestion.1 parent dd2481f commit 3a261da
3 files changed
Lines changed: 5 additions & 4 deletions
File tree
- examples/pipelines
- src
- Normalizers
- PretrainedTokenizers
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
14 | | - | |
| 14 | + | |
| 15 | + | |
15 | 16 | | |
16 | 17 | | |
17 | 18 | | |
18 | | - | |
| 19 | + | |
| 20 | + | |
19 | 21 | | |
20 | 22 | | |
21 | 23 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
6 | 5 | | |
7 | 6 | | |
8 | 7 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
| 13 | + | |
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| |||
0 commit comments