Skip to content

inconsistent output from get_tokenized_data #181

@andrea-fasoli

Description

@andrea-fasoli

Describe the bug

The function preparing data for DQ (get_tokenized_data in fms_mo/utils/calib_data.py) should be capable of providing data processed from different dataset: wiki, c4, ptb, etc.
While the wiki option is functional and well integrated with DQ, the type of objects returned for other datasets are not consistent and can fail to run DQ and/or eval.

Expected behavior

Consistent return type of get_tokenized_data across all selected datasets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions