|
511 | 511 |
|
512 | 512 | ```make wreck_the_warc``` |
513 | 513 |
|
| 514 | +``` |
| 515 | +we will break and then fix this warc |
| 516 | +cp whirlwind.warc.gz testing.warc.gz |
| 517 | +rm -f testing.warc |
| 518 | +gunzip testing.warc.gz |
| 519 | +
|
| 520 | +iterate over this uncompressed warc: works |
| 521 | +python ./warcio-iterator.py testing.warc |
| 522 | + WARC-Type: warcinfo |
| 523 | + WARC-Type: request |
| 524 | + WARC-Target-URI https://an.wikipedia.org/wiki/Escopete |
| 525 | + WARC-Type: response |
| 526 | + WARC-Target-URI https://an.wikipedia.org/wiki/Escopete |
| 527 | + WARC-Type: metadata |
| 528 | + WARC-Target-URI https://an.wikipedia.org/wiki/Escopete |
| 529 | +
|
| 530 | +compress it the wrong way |
| 531 | +gzip testing.warc |
| 532 | +
|
| 533 | +iterating over this compressed warc fails |
| 534 | +python ./warcio-iterator.py testing.warc.gz || /usr/bin/true |
| 535 | + WARC-Type: warcinfo |
| 536 | +Traceback (most recent call last): |
| 537 | + File "/home/ccgreg/github/whirlwind-python/./warcio-iterator.py", line 9, in <module> |
| 538 | + for record in ArchiveIterator(stream): |
| 539 | + File "/home/ccgreg/venv/whirlwind/lib/python3.10/site-packages/warcio/archiveiterator.py", line 112, in _iterate_records |
| 540 | + self._raise_invalid_gzip_err() |
| 541 | + File "/home/ccgreg/venv/whirlwind/lib/python3.10/site-packages/warcio/archiveiterator.py", line 153, in _raise_invalid_gzip_err |
| 542 | + raise ArchiveLoadFailed(msg) |
| 543 | +warcio.exceptions.ArchiveLoadFailed: |
| 544 | + ERROR: non-chunked gzip file detected, gzip block continues |
| 545 | + beyond single record. |
| 546 | +
|
| 547 | + This file is probably not a multi-member gzip but a single gzip file. |
| 548 | +
|
| 549 | + To allow seek, a gzipped WARC must have each record compressed into |
| 550 | + a single gzip member and concatenated together. |
| 551 | +
|
| 552 | + This file is likely still valid and can be fixed by running: |
| 553 | +
|
| 554 | + warcio recompress <path/to/file> <path/to/new_file> |
| 555 | +
|
| 556 | +
|
| 557 | +
|
| 558 | +now let's do it the right way |
| 559 | +gunzip testing.warc.gz |
| 560 | +warcio recompress testing.warc testing.warc.gz |
| 561 | +4 records read and recompressed to file: testing.warc.gz |
| 562 | +No Errors Found! |
| 563 | +
|
| 564 | +and now iterating works |
| 565 | +python ./warcio-iterator.py testing.warc.gz |
| 566 | + WARC-Type: warcinfo |
| 567 | + WARC-Type: request |
| 568 | + WARC-Target-URI https://an.wikipedia.org/wiki/Escopete |
| 569 | + WARC-Type: response |
| 570 | + WARC-Target-URI https://an.wikipedia.org/wiki/Escopete |
| 571 | + WARC-Type: metadata |
| 572 | + WARC-Target-URI https://an.wikipedia.org/wiki/Escopete |
| 573 | +``` |
| 574 | + |
514 | 575 | ## Coda |
515 | 576 |
|
516 | 577 | You have now finished this whirlwind tutorial. If anything |
|
0 commit comments