WARC4Spark

This repository contains a Spark data source for WARC files, based on jwarc.

The data source supports column pruning and filter (or predicate) pushdown to improve read efficiency, and is able to natively read WARC, WET and WAT files.

Usage

Using the data source is as simple as choosing org.rubigdata.warc as input format for a Spark DataFrame:

val df = spark
  .read
  .format("org.rubigdata.warc")
  .load(input)

The resulting DataFrame will have the following schema:

root
 |-- warcId: string (nullable = false)
 |-- warcType: string (nullable = false)
 |-- warcTargetUri: string (nullable = true)
 |-- warcDate: timestamp (nullable = false)
 |-- warcContentType: string (nullable = true)
 |-- warcHeaders: map (nullable = false)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = false)
 |    |    |-- element: string (containsNull = false)
 |-- warcBody: string (nullable = false)

Similarly, if we want to parse the WARC record bodies as HTTP responses (for WARC records of type response), we can use:

val df = spark
  .read
  .format("org.rubigdata.warc")
  .option("parseHTTP", "true")
  .load(input)

This will result in a DataFrame with the following schema:

root
 |-- warcId: string (nullable = false)
 |-- warcType: string (nullable = false)
 |-- warcTargetUri: string (nullable = true)
 |-- warcDate: timestamp (nullable = false)
 |-- warcContentType: string (nullable = true)
 |-- warcHeaders: map (nullable = false)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = false)
 |    |    |-- element: string (containsNull = false)
 |-- httpContentType: string (nullable = true)
 |-- httpHeaders: map (nullable = true)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = false)
 |    |    |-- element: string (containsNull = false)
 |-- httpBody: string (nullable = true)

Predicate pushdown is supported for columns warcId, warcType, warcTargetUri, warcDate, warcContentType and httpContentType.

Parser

We currently support two WARC parsing libraries: jwarc and JWAT. jwarc should be more efficient and follows the WARC standard more strictly. JWAT, on the other hand, parses more leniently and can be useful for processing WARC files that jwarc is unable to handle.

jwarc is enabled as default parser, but you can choose to use JWAT by supplying the parser option (see below).

Options

The WARC input format supports the following read options:

Option name	Type	Description
headersToLowerCase	bool	Convert all WARC and HTTP header keys to lowercase.
lenient	bool	Enables lenient WARC parsing (e.g. allowing invalid UTF-8).
parseHTTP	bool	Parses the WARC body as a HTTP response. Replaces the `warcBody` column with `httpContent`, `httpHeaders` and `httpBody` columns.
parser	string	Choose the WARC parser to use. Supported options: [`jwarc`] (default) and [`jwat`]
splitGzip	bool	Treat Gzip files as a concatenation of many smaller Gzipped WARC records, meaning Spark can split each file into many smaller partitions.

It also supports the generic file source options, and can read Hive-partitioned paths like any other file-based data source.

Examples

Below we give some examples on how to work with the WARC data source:

Loading Dutch WARC response records

val nlResponseRecords = spark
  .read
  .format("org.rubigdata.warc")
  .load("/data/warc")
  .filter($"warcType" === "response")
  .filter($"warcTargetUri".contains(".nl"))
  .select($"warcTargetUri", $"warcBody")

nlResponseRecords.show()

This also nicely highlights the predicate pushdown and column pruning capabilities of the data source:

nlResponseRecords.explain()

Which will tell you:

== Physical Plan ==
*(1) Project [warcTargetUri#106, warcBody#110]
+- BatchScan warc [/data/warc][warcTargetUri#106, warcBody#110] warc [/data/warc]
     PushedFilters: [IsNotNull(warcTargetUri), EqualTo(warcType,response), StringContains(warcTargetUri,.nl)],
     ReadSchema: struct<warcTargetUri:string,warcBody:string> RuntimeFilters: []

Extracting all titles from the HTML records using Jsoup

import org.jsoup.Jsoup

val extractTitle = udf( (html: String) => {
  val d = Jsoup.parse(html)
  d.title()
} )

val htmlTitles = spark
  .read
  .format("org.rubigdata.warc")
  .option("parseHTTP", true)
  .load("/data/warc")
  .filter($"warcType" === "response")
  .select($"warcTargetUri", extractTitle($"httpBody").as("title"))

htmlTitles.show()

Get all the text from the WET files

spark
  .read
  .format("org.rubigdata.warc")
  .load("/data/wet")
  .filter($"warcType" === "conversion")
  .select($"warcTargetUri", $"warcBody".as("text"))

Extract all links on the page using the WAT files

Note that the WAT files contain quite complex, nested JSON data, so it can be tricky to work with that data in Spark.

// First we filter the WAT records so that we only keep the records
// associated with a WARC response record.
val wat = spark
  .read
  .format("org.rubigdata.warc")
  .load("/data/wat")
  .filter($"warcType" === "metadata")
  .select(
    $"warcTargetUri",
    get_json_object(
      $"warcBody",
      "$.Envelope.Payload-Metadata"
    ).as("payloadMetadata")
  )
  .filter(
    get_json_object(
      $"payloadMetadata",
      "$.Actual-Content-Type"
    ) === "application/http; msgtype=response"
  )
  .select(
    $"warcTargetUri",
    get_json_object(
      $"payloadMetadata",
      "$.HTTP-Response-Metadata.HTML-Metadata"
    ).as("metadata")
  )

// Spark needs a specific JSON example to detect the schema
val jsonExample = wat.as[(String, String)].first._2

// We convert the JSON column into Spark nested types
val metadata = wat
  .select(
    $"warcTargetUri",
    from_json($"metadata", schema_of_json(jsonExample)).as("metadata")
  )

// Now we can directly query the `metadata` column
metadata
  .select($"warcTargetUri", $"metadata.Links.url".as("links"))
  .show()

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
project		project
src/main/scala/org/rubigdata/warc		src/main/scala/org/rubigdata/warc
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WARC4Spark

Usage

Parser

Options

Examples

Loading Dutch WARC response records

Extracting all titles from the HTML records using Jsoup

Get all the text from the WET files

Extract all links on the page using the WAT files

About

Uh oh!

Releases 3

Packages

Uh oh!

Languages

rubigdata/warc-for-spark

Folders and files

Latest commit

History

Repository files navigation

WARC4Spark

Usage

Parser

Options

Examples

Loading Dutch WARC response records

Extracting all titles from the HTML records using Jsoup

Get all the text from the WET files

Extract all links on the page using the WAT files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Languages

Packages