Skip to content

Commit 85857c6

Browse files
committed
feat: filter records using jwarc, remove custom code
1 parent bcc8549 commit 85857c6

3 files changed

Lines changed: 4 additions & 390 deletions

File tree

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ Now that we've looked at the uncompressed versions of these files to understand
9292

9393
The [JWarc](https://github.com/iipc/jwarc) Java library lets us read and write WARC files both programmatically and via a CLI.
9494

95-
You should download the [JWarc](https://github.com/iipc/jwarc)'s JAR using `make get_jwarc` which should download the JAR in the root directory.
95+
You should download the [JWarc](https://github.com/iipc/jwarc)'s JAR using `make jwarc.jar` which should download the JAR in the root directory.
9696
If you download it yourself, we recommend you to rename it to remove the version from the jar filename, so you can copy-paste the commands directly.
9797
You can now explore the CLI commands available by running:
9898

@@ -434,16 +434,16 @@ We can create our own CDXJ index from the local WARCs by running:
434434

435435
```make cdxj```
436436

437-
This uses the JWARC library and, partially, a home-cooked code that we wrote to support WET and WAT records, to generate CDXJ index files for our WARC files by running the code below:
437+
This uses the JWARC library to generate CDXJ index files for our WARC files by running the code below:
438438

439439
<details>
440440
<summary>Click to view code</summary>
441441

442442
```
443443
creating *.cdxj index files from the local warcs
444444
java -jar jwarc.jar cdxj data/whirlwind.warc.gz > whirlwind.warc.cdxj
445-
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.CdxjIndexer -Dexec.args="data/whirlwind.warc.wet.gz --records conversion" > whirlwind.warc.wet.cdxj
446-
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.CdxjIndexer -Dexec.args="data/whirlwind.warc.wat.gz --records metadata" > whirlwind.warc.wat.cdxj
445+
java -jar jwarc.jar cdxj data/whirlwind.warc.wet.gz --record-type conversion > whirlwind.warc.wet.cdxj
446+
java -jar jwarc.jar cdxj data/whirlwind.warc.wat.gz --record-type metadata > whirlwind.warc.wat.cdxj
447447
```
448448

449449
</details>

src/main/java/org/commoncrawl/whirlwind/CdxWriterWithDynamicFiltering.java

Lines changed: 0 additions & 275 deletions
This file was deleted.

0 commit comments

Comments
 (0)