Skip to content

Commit a7b2837

Browse files
author
Greg Lindahl
committed
doc: add wreck the warc output
1 parent c10f363 commit a7b2837

1 file changed

Lines changed: 61 additions & 0 deletions

File tree

README.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -511,6 +511,67 @@ Run
511511

512512
```make wreck_the_warc```
513513

514+
```
515+
we will break and then fix this warc
516+
cp whirlwind.warc.gz testing.warc.gz
517+
rm -f testing.warc
518+
gunzip testing.warc.gz
519+
520+
iterate over this uncompressed warc: works
521+
python ./warcio-iterator.py testing.warc
522+
WARC-Type: warcinfo
523+
WARC-Type: request
524+
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
525+
WARC-Type: response
526+
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
527+
WARC-Type: metadata
528+
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
529+
530+
compress it the wrong way
531+
gzip testing.warc
532+
533+
iterating over this compressed warc fails
534+
python ./warcio-iterator.py testing.warc.gz || /usr/bin/true
535+
WARC-Type: warcinfo
536+
Traceback (most recent call last):
537+
File "/home/ccgreg/github/whirlwind-python/./warcio-iterator.py", line 9, in <module>
538+
for record in ArchiveIterator(stream):
539+
File "/home/ccgreg/venv/whirlwind/lib/python3.10/site-packages/warcio/archiveiterator.py", line 112, in _iterate_records
540+
self._raise_invalid_gzip_err()
541+
File "/home/ccgreg/venv/whirlwind/lib/python3.10/site-packages/warcio/archiveiterator.py", line 153, in _raise_invalid_gzip_err
542+
raise ArchiveLoadFailed(msg)
543+
warcio.exceptions.ArchiveLoadFailed:
544+
ERROR: non-chunked gzip file detected, gzip block continues
545+
beyond single record.
546+
547+
This file is probably not a multi-member gzip but a single gzip file.
548+
549+
To allow seek, a gzipped WARC must have each record compressed into
550+
a single gzip member and concatenated together.
551+
552+
This file is likely still valid and can be fixed by running:
553+
554+
warcio recompress <path/to/file> <path/to/new_file>
555+
556+
557+
558+
now let's do it the right way
559+
gunzip testing.warc.gz
560+
warcio recompress testing.warc testing.warc.gz
561+
4 records read and recompressed to file: testing.warc.gz
562+
No Errors Found!
563+
564+
and now iterating works
565+
python ./warcio-iterator.py testing.warc.gz
566+
WARC-Type: warcinfo
567+
WARC-Type: request
568+
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
569+
WARC-Type: response
570+
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
571+
WARC-Type: metadata
572+
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
573+
```
574+
514575
## Coda
515576

516577
You have now finished this whirlwind tutorial. If anything

0 commit comments

Comments
 (0)