Automatic deduplication works well (#25), however, when duplicates are found and removed, the datastore table and the resource file are no longer in sync.
Smarter dedup can be handled three ways. When dupes are found:
- Stop the DP+ job and show the dupe error in the Datastore tab.
- Replace the resource file with the dedupped CSV.
- Take advantage of
qsv dedup's --dupes-output option and create two new resources - RESOURCENAME_dupes.csv and RESOURCENAME_dedupped.csv which are pushed to the Datastore. The original resource with dupes is NOT pushed. The Data Publisher can then just use the CKAN interface to manage which resource to keep (e.g. delete the original and the _dupes resources; rename the _dedupped resource, removing the _dedupped suffix.)
Automatic deduplication works well (#25), however, when duplicates are found and removed, the datastore table and the resource file are no longer in sync.
Smarter dedup can be handled three ways. When dupes are found:
qsv dedup's--dupes-outputoption and create two new resources - RESOURCENAME_dupes.csv and RESOURCENAME_dedupped.csv which are pushed to the Datastore. The original resource with dupes is NOT pushed. The Data Publisher can then just use the CKAN interface to manage which resource to keep (e.g. delete the original and the _dupes resources; rename the _dedupped resource, removing the _dedupped suffix.)