Skip to content

Commit 29983d1

Browse files
wumpusJason GreyGreg Lindahl
authored
bug: fixes and doc:stuff
* added links for tools, and fix reference to warcio-iterator.py * added link for SURT * edited to finish thorugh on cdx fields, then talk about concept of zipnum/index servers * git ignore collinfo.json * fix collinfo.json spelling * bug: fix for macos * doc: tweaks Co-authored-by: Jason Grey <jason.grey@warecorp.com> Co-authored-by: Greg Lindahl <greg@commomncrawl.org>
1 parent 5e02863 commit 29983d1

4 files changed

Lines changed: 25 additions & 22 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ TEST*.gz
55
extraction.*
66
testing.*
77
whirlwind.parquet
8+
collinfo.json

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ wreck_the_warc:
8585
gzip testing.warc
8686
@echo
8787
@echo iterating over this compressed warc fails
88-
python ./warcio-iterator.py testing.warc.gz || /bin/true
88+
python ./warcio-iterator.py testing.warc.gz || /usr/bin/true
8989
@echo
9090
@echo "now let's do it the right way"
9191
gunzip testing.warc.gz

README.md

Lines changed: 22 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ columnar index.
1010

1111
The object of this whirlwind tour is to show you how a single webpage
1212
appears in all of these different places. It uses python-based tools
13-
such as warcio, cdxj-indexer, cdx_toolkit, and duckdb.
13+
such as [warcio](https://github.com/webrecorder/warcio), [cdxj-indexer](https://github.com/webrecorder/cdxj-indexer), [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit), and [duckdb](https://duckdb.org/).
1414

1515
That webpage is [https://an.wikipedia.org/wiki/Escopete](https://an.wikipedia.org/wiki/Escopete),
1616
which we crawled on the date 2024-05-18T01:58:10Z.
@@ -93,7 +93,7 @@ The output has 3 sections, one each for the warc, wet, and wat. It prints
9393
the record types, you've seen these before. And for the record types
9494
that have an Target-URI as part of their warc headers, it prints that URI.
9595

96-
Take a look at the program `warc-iterator.py`. It's a very simple example
96+
Take a look at the program `warcio-iterator.py`. It's a very simple example
9797
of how to iterate over all of the records in a warc file.
9898

9999
## Index warc, wet, and wat
@@ -111,27 +111,26 @@ indexed -- we're guessing that you won't ever want to random-access
111111
the request or metadata. wet and wat have the conversion and metadata
112112
records indexed.
113113

114+
(Note: CCF doesn't publish a wet or wat index, just warc.)
115+
114116
For each of these records, there's one text line in the index -- yes,
115117
it's a flat file! It starts with a string like
116118
`org,wikipedia,an)/wiki/escopete 20240518015810` followed by a json
117119
blob.
118120

119121
The starting string is the primary key of the index. The first
120-
thing is a SURT (Sort-friendly URI Reordering Transform). The big integer
122+
thing is a [SURT](http://crawler.archive.org/articles/user_manual/glossary.html#surt)
123+
(Sort-friendly URI Reordering Transform). The big integer
121124
is a date, in ISO-8601 format with the delimiters removed.
122125

123-
What is the purpose of this funky format? It's done this way because these
124-
flat files -- 300 gigabytes per crawl -- can be sorted on the primary key
125-
using the standard Linux `sort` utility. Then it's stored in a funky
126-
format called a zipnum, and that's how our cdx index server works.
127-
128-
The json blob has enough information to extract individual records -- it
129-
says which warc the record is in, and the offset and length of the record.
130-
We'll use that in the next section.
126+
What is the purpose of this funky format? It's done this way because
127+
these flat files (300 gigabytes total per crawl) can be sorted on the
128+
primary key using any out-of-core sort utility -- like the standard
129+
Linux `sort`, or one of the Hadoop-based out-of-core sort functions.
131130

132-
Wayback machines (such as the Internet Archive's wayback at
133-
https://web.archive.org/) are often powered by a cdx index
134-
server.
131+
The json blob has enough information to extract individual records --
132+
it says which warc file the record is in, and the offset and length of
133+
the record. We'll use that in the next section.
135134

136135
## Extract the raw content from local warc, wet, wat
137136

@@ -170,14 +169,16 @@ TEST-000000.extracted.warc.gz, with this one record plus a warcinfo
170169
record explaining what this warc is. The Makefile target also runs
171170
cdxj-indexer on this new warc, and iterates over it.
172171

173-
If you dig into the code of cdx_toolkit's code, you'll find that it is
174-
using the offset and length of the warc record, returned by the cdx
175-
index query, to make a http byte range request to S3 to download this
176-
single record.
172+
If you dig into cdx_toolkit's code, you'll find that it is using the
173+
offset and length of the warc record, returned by the cdx index query,
174+
to make a http byte range request to S3 to download this single warc
175+
record.
177176

178-
It is only downloading the warc record, because our actual cdx index
179-
only has the response records in it. You cannot random access the
180-
wet and wat records due to lack of an index.
177+
It is only downloading the response warc record, because our
178+
cdx index only has the response records indexed. The public cannot
179+
random access the wet and wat records due to lack of an index. We do
180+
have these indexes internally though, if you need them. This may
181+
change in future if we make the wat/wet indexes public.
181182

182183
## The columnar index
183184

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ duckdb
55
pyarrow
66
pandas
77
polars
8+
cdxj-indexer

0 commit comments

Comments
 (0)