You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* added links for tools, and fix reference to warcio-iterator.py
* added link for SURT
* edited to finish thorugh on cdx fields, then talk about concept of zipnum/index servers
* git ignore collinfo.json
* fix collinfo.json spelling
* bug: fix for macos
* doc: tweaks
Co-authored-by: Jason Grey <jason.grey@warecorp.com>
Co-authored-by: Greg Lindahl <greg@commomncrawl.org>
Copy file name to clipboardExpand all lines: README.md
+22-21Lines changed: 22 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ columnar index.
10
10
11
11
The object of this whirlwind tour is to show you how a single webpage
12
12
appears in all of these different places. It uses python-based tools
13
-
such as warcio, cdxj-indexer, cdx_toolkit, and duckdb.
13
+
such as [warcio](https://github.com/webrecorder/warcio), [cdxj-indexer](https://github.com/webrecorder/cdxj-indexer), [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit), and [duckdb](https://duckdb.org/).
14
14
15
15
That webpage is [https://an.wikipedia.org/wiki/Escopete](https://an.wikipedia.org/wiki/Escopete),
16
16
which we crawled on the date 2024-05-18T01:58:10Z.
@@ -93,7 +93,7 @@ The output has 3 sections, one each for the warc, wet, and wat. It prints
93
93
the record types, you've seen these before. And for the record types
94
94
that have an Target-URI as part of their warc headers, it prints that URI.
95
95
96
-
Take a look at the program `warc-iterator.py`. It's a very simple example
96
+
Take a look at the program `warcio-iterator.py`. It's a very simple example
97
97
of how to iterate over all of the records in a warc file.
98
98
99
99
## Index warc, wet, and wat
@@ -111,27 +111,26 @@ indexed -- we're guessing that you won't ever want to random-access
111
111
the request or metadata. wet and wat have the conversion and metadata
112
112
records indexed.
113
113
114
+
(Note: CCF doesn't publish a wet or wat index, just warc.)
115
+
114
116
For each of these records, there's one text line in the index -- yes,
115
117
it's a flat file! It starts with a string like
116
118
`org,wikipedia,an)/wiki/escopete 20240518015810` followed by a json
117
119
blob.
118
120
119
121
The starting string is the primary key of the index. The first
120
-
thing is a SURT (Sort-friendly URI Reordering Transform). The big integer
122
+
thing is a [SURT](http://crawler.archive.org/articles/user_manual/glossary.html#surt)
123
+
(Sort-friendly URI Reordering Transform). The big integer
121
124
is a date, in ISO-8601 format with the delimiters removed.
122
125
123
-
What is the purpose of this funky format? It's done this way because these
124
-
flat files -- 300 gigabytes per crawl -- can be sorted on the primary key
125
-
using the standard Linux `sort` utility. Then it's stored in a funky
126
-
format called a zipnum, and that's how our cdx index server works.
127
-
128
-
The json blob has enough information to extract individual records -- it
129
-
says which warc the record is in, and the offset and length of the record.
130
-
We'll use that in the next section.
126
+
What is the purpose of this funky format? It's done this way because
127
+
these flat files (300 gigabytes total per crawl) can be sorted on the
128
+
primary key using any out-of-core sort utility -- like the standard
129
+
Linux `sort`, or one of the Hadoop-based out-of-core sort functions.
131
130
132
-
Wayback machines (such as the Internet Archive's wayback at
133
-
https://web.archive.org/) are often powered by a cdx index
134
-
server.
131
+
The json blob has enough information to extract individual records --
132
+
it says which warc file the record is in, and the offset and length of
133
+
the record. We'll use that in the next section.
135
134
136
135
## Extract the raw content from local warc, wet, wat
137
136
@@ -170,14 +169,16 @@ TEST-000000.extracted.warc.gz, with this one record plus a warcinfo
170
169
record explaining what this warc is. The Makefile target also runs
171
170
cdxj-indexer on this new warc, and iterates over it.
172
171
173
-
If you dig into the code of cdx_toolkit's code, you'll find that it is
174
-
using the offset and length of the warc record, returned by the cdx
175
-
index query, to make a http byte range request to S3 to download this
176
-
single record.
172
+
If you dig into cdx_toolkit's code, you'll find that it is using the
173
+
offset and length of the warc record, returned by the cdx index query,
174
+
to make a http byte range request to S3 to download this single warc
175
+
record.
177
176
178
-
It is only downloading the warc record, because our actual cdx index
179
-
only has the response records in it. You cannot random access the
180
-
wet and wat records due to lack of an index.
177
+
It is only downloading the response warc record, because our
178
+
cdx index only has the response records indexed. The public cannot
179
+
random access the wet and wat records due to lack of an index. We do
180
+
have these indexes internally though, if you need them. This may
181
+
change in future if we make the wat/wet indexes public.
0 commit comments