Skip to content

Commit 9916046

Browse files
committed
Added a block to make morphology in Czech UD more consistent.
1 parent d724abe commit 9916046

1 file changed

Lines changed: 21 additions & 0 deletions

File tree

udapi/block/ud/cs/fixmorpho.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
"""
2+
A Czech-specific block to fix lemmas, UPOS and morphological features in UD.
3+
It should increase consistency across the Czech treebanks. It focuses on
4+
individual closed-class verbs (such as the auxiliary "být") or on entire classes
5+
of words (e.g. whether or not nouns should have the Polarity feature). It was
6+
created as part of the Hičkok project (while importing nineteenth-century Czech
7+
data) but it should be applicable on any other Czech treebank.
8+
"""
9+
import udapi.block.ud.fixmorpho
10+
import re
11+
12+
class FixMorpho(udapi.block):
13+
14+
def process_node(self, node):
15+
# In Czech UD, "být" is always tagged as AUX and never as VERB, regardless
16+
# of the fact that it can participate in purely existential constructions
17+
# where it no longer acts as a copula. Czech tagsets typically do not
18+
# distinguish AUX from VERB, which means that converted data may have to
19+
# be fixed.
20+
if node.upos == 'VERB' and node.lemma == 'být':
21+
node.upos = 'AUX'

0 commit comments

Comments
 (0)