Skip to content

toimik/WarcProtocol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ae55661 · Jul 9, 2024

History

97 Commits
May 1, 2024
Jul 9, 2024
Jul 9, 2024
May 1, 2024
Sep 15, 2021
Sep 15, 2021
Apr 17, 2023
Sep 15, 2021
May 1, 2024
Sep 15, 2021

Repository files navigation

Code Coverage Nuget

Toimik.WarcProtocol

.NET 8 C# WARC parser and writer.

Features

  • Parses uncompressed / compressed - as a whole or per record - Web ARChive (WARC) version 1.0 / 1.1 files (via WarcParser class)
  • Option to terminate or resume processing upon encountering malformed content
  • Creates WARC records with auto-generated WARC-Block-Digest / WARC-Payload-Digest using configurable hash algorithm
  • Writes uncompressed or per-record compressed WARC version 1.0 / 1.1 files (via WarcWriter class)

Quick Start

Installation

Package Manager

PM> Install-Package Toimik.WarcProtocol

.NET CLI

> dotnet add package Toimik.WarcProtocol

WarcParser Usage

The WarcParser class can be used to parse WARC 1.0 and 1.1 files.

using System.Diagnostics;
using System.Threading.Tasks;
using Toimik.WarcProtocol;

class Program
{
    static async Task Main(string[] args)
    {
        var parser = new WarcParser();

        // Path to a WARC file that contains one or more records, which may be uncompressed or
        // compressed (as a whole or per record)
        var path = "uncompressed.warc";

        // Use a '.gz' file extension to indicate that the file is compressed using GZip
        // var path "compressed.warc.gz";

        // In case any part of the WARC file is malformed and parsing is expected to resume -
        // rather than throwing an exception and terminates - a parse log is specified
        var parseLog = new DebugParseLog();

        // Parse the file and process the records accordingly
        var records = parser.Parse(path, parseLog);

        // Alternatively, use a stream:
        // var stream = ...
        // var isCompressed = ... // true if the stream is compressed; false otherwise
        // var records = parser.Parse(stream, isCompressed, parseLog);

        await foreach (WarcProtocol.Record record in records)
        {
            switch (record.Type)
            {
                case ContinuationRecord.TypeName:

                    // ...
                    break;

                case ConversionRecord.TypeName:

                    // ...
                    break;

                case MetadataRecord.TypeName:

                    // ...
                    break;

                case RequestRecord.TypeName:

                    // ...
                    break;

                case ResourceRecord.TypeName:

                    // ...
                    break;

                case ResponseRecord.TypeName:

                    // ...
                    break;

                case RevisitRecord.TypeName:

                    // ...
                    break;

                case WarcinfoRecord.TypeName:

                    // ...
                    break;

                case "custom-type":

                    // ...
                    break;
            }
        }
    }

    class DebugParseLog : IParseLog
    {
        public void ChunkSkipped(string chunk)
        {
            Debug.WriteLine(chunk);
        }

        public void ErrorEncountered(string error)
        {
            Debug.WriteLine(error);
        }
    }
}

WarcWriter Usage

The WarcWriter class can be used to write WARC records to a file.

using Toimik.WarcProtocol;

class Program
{
    static void Main(string[] args)
    {
        // Creates a new WARC file with per-record compression
        // Wrap the writer in a "using" block to ensure data
        // is properly flushed to the file.
        // Or call "Close()" directly
        using (WarcWriter warcWriter = new WarcWriter("example.warc.gz"))
        {
            warcWriter.Write(warcInfoRecord);
            warcWriter.Write(requestRecord);
            warcWriter.Write(responseRecord);
        }
        
        // You can also create uncompressed WARCs. This is controlled via the file extension.
        using (WarcWriter writer = new WarcWriter("uncompressed.warc"))
        {
            // ...
        }
        
        // You can also force per-record compression, regardless of file extension
        using (WarcWriter writer = new WarcWriter("actually-compressed.warc.whatever", true))
        {
            // ...
        }
    }
}