warc

package module
v0.8.73 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 3, 2025 License: CC0-1.0 Imports: 36 Imported by: 2

README

warc

GoDoc Go Report Card

WARNING: This project is no longer a work-in-progress, but needs to be carefully implemented and tested, but is generating valid WARCs when used correctly!

Introduction

warc provides methods for reading and writing WARC files in Go. This module is based on nlevitt's WARC module.

Install

go get github.com/CorentinB/warc

License

warc is released under CC0 license. You can find a copy of the CC0 License in the LICENSE file.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	IPv6 *availableIPs
	IPv4 *availableIPs
)
View Source
var (

	// Create a counter to keep track of the number of bytes written to WARC files
	// and the number of bytes deduped
	DataTotal         *ratecounter.Counter
	RemoteDedupeTotal *ratecounter.Counter
	LocalDedupeTotal  *ratecounter.Counter
)
View Source
var CDXHTTPClient = http.Client{
	Timeout: 10 * time.Second,
	Transport: &http.Transport{
		Dial: (&net.Dialer{
			Timeout: 5 * time.Second,
		}).Dial,
		TLSHandshakeTimeout: 5 * time.Second,
	},
}

Functions

func GetNextIP added in v0.8.56

func GetNextIP(availableIPs *availableIPs) net.IP

func GetSHA1

func GetSHA1(r io.Reader) string

func GetSHA256 added in v0.8.37

func GetSHA256(r io.Reader) string

func GetSHA256Base16 added in v0.8.37

func GetSHA256Base16(r io.Reader) string

func NewDecompressionReader added in v0.8.41

func NewDecompressionReader(r io.Reader) (io.Reader, error)

NewDecompressionReader will return a new reader transparently doing decompression of GZip, BZip2, XZ, and ZStd.

Types

type CustomHTTPClient added in v0.7.0

type CustomHTTPClient struct {
	WaitGroup *WaitGroupWithCount

	ErrChan    chan *Error
	WARCWriter chan *RecordBatch

	http.Client
	TempDir                string
	WARCWriterDoneChannels []chan bool

	TLSHandshakeTimeout   time.Duration
	MaxReadBeforeTruncate int

	FullOnDisk bool

	// MaxRAMUsageFraction is the fraction of system RAM above which we'll force spooling to disk. For example, 0.5 = 50%.
	// If set to <= 0, the default value is DefaultMaxRAMUsageFraction.
	MaxRAMUsageFraction float64
	// contains filtered or unexported fields
}

func NewWARCWritingHTTPClient added in v0.7.0

func NewWARCWritingHTTPClient(HTTPClientSettings HTTPClientSettings) (httpClient *CustomHTTPClient, err error)

func (*CustomHTTPClient) Close added in v0.7.0

func (c *CustomHTTPClient) Close() error

func (*CustomHTTPClient) WriteRecord added in v0.8.50

func (c *CustomHTTPClient) WriteRecord(WARCTargetURI, WARCType, contentType, payloadString string, payloadReader io.Reader)

type DedupeOptions added in v0.8.0

type DedupeOptions struct {
	CDXURL        string
	CDXCookie     string
	SizeThreshold int
	LocalDedupe   bool
	CDXDedupe     bool
}

type Error added in v0.8.29

type Error struct {
	Err  error
	Func string
}

type HTTPClientSettings added in v0.8.14

type HTTPClientSettings struct {
	RotatorSettings       *RotatorSettings
	Proxy                 string
	TempDir               string
	DNSServer             string
	SkipHTTPStatusCodes   []int
	DNSServers            []string
	DedupeOptions         DedupeOptions
	DialTimeout           time.Duration
	ResponseHeaderTimeout time.Duration
	DNSResolutionTimeout  time.Duration
	DNSRecordsTTL         time.Duration
	DNSCacheSize          int
	TLSHandshakeTimeout   time.Duration
	TCPTimeout            time.Duration
	MaxReadBeforeTruncate int
	DecompressBody        bool
	FollowRedirects       bool
	FullOnDisk            bool
	MaxRAMUsageFraction   float64
	VerifyCerts           bool
	RandomLocalIP         bool
	DisableIPv4           bool
	DisableIPv6           bool
	IPv6AnyIP             bool
}
type Header map[string]string

Header provides information about the WARC record. It stores WARC record field names and their values. Since WARC field names are case-insensitive, the Header methods are case-insensitive as well.

func NewHeader

func NewHeader() Header

NewHeader creates a new WARC header.

func (Header) Del

func (h Header) Del(key string)

Del deletes the value associated with key.

func (Header) Get

func (h Header) Get(key string) string

Get returns the value associated with the given key. If there is no value associated with the key, Get returns "".

func (Header) Set

func (h Header) Set(key, value string)

Set sets the header field associated with key to value.

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

Reader store the bufio.Reader and gzip.Reader for a WARC file

func NewReader

func NewReader(reader io.ReadCloser) (*Reader, error)

NewReader returns a new WARC reader

func (*Reader) ReadRecord

func (r *Reader) ReadRecord() (*Record, bool, error)

ReadRecord reads the next record from the opened WARC file returns:

  • Record: if an error occurred, record **may be** nil. if eol is true, record **must be** nil.
  • bool (eol): if true, we readed all records successfully.
  • error: error

type Record

type Record struct {
	Header  Header
	Content spooledtempfile.ReadWriteSeekCloser
	Version string // WARC/1.0, WARC/1.1 ...
}

Record represents a WARC record.

func NewRecord

func NewRecord(tempDir string, fullOnDisk bool) *Record

NewRecord creates a new WARC record.

type RecordBatch

type RecordBatch struct {
	FeedbackChan chan struct{}
	CaptureTime  string
	Records      []*Record
}

RecordBatch is a structure that contains a bunch of records to be written at the same time, and a common capture timestamp. FeedbackChan is used to signal when the records have been written.

func NewRecordBatch

func NewRecordBatch(feedbackChan chan struct{}) *RecordBatch

NewRecordBatch creates a record batch, it also initialize the capture time.

type RotatorSettings

type RotatorSettings struct {
	// Content of the warcinfo record that will be written
	// to all WARC files
	WarcinfoContent Header
	// Prefix used for WARC filenames, WARC 1.1 specifications
	// recommend to name files this way:
	// Prefix-Timestamp-Serial-Crawlhost.warc.gz
	Prefix string
	// Compression algorithm to use
	Compression string
	// Path to a ZSTD compression dictionary to embed (and use) in .warc.zst files
	CompressionDictionary string
	// Directory where the created WARC files will be stored,
	// default will be the current directory
	OutputDirectory string
	// WarcSize is in Megabytes
	WarcSize float64
	// WARCWriterPoolSize defines the number of parallel WARC writers
	WARCWriterPoolSize int
}

RotatorSettings is used to store the settings needed by recordWriter to write WARC files

func NewRotatorSettings

func NewRotatorSettings() *RotatorSettings

NewRotatorSettings creates a RotatorSettings structure and initialize it with default values

func (*RotatorSettings) NewWARCRotator

func (s *RotatorSettings) NewWARCRotator() (recordWriterChan chan *RecordBatch, doneChannels []chan bool, err error)

NewWARCRotator creates and return a channel that can be used to communicate records to be written to WARC files to the recordWriter function running in a goroutine

type WaitGroupWithCount added in v0.8.18

type WaitGroupWithCount struct {
	sync.WaitGroup
	// contains filtered or unexported fields
}

func (*WaitGroupWithCount) Add added in v0.8.18

func (wg *WaitGroupWithCount) Add(delta int)

func (*WaitGroupWithCount) Done added in v0.8.18

func (wg *WaitGroupWithCount) Done()

func (*WaitGroupWithCount) Size added in v0.8.18

func (wg *WaitGroupWithCount) Size() int

type Writer

type Writer struct {
	GZIPWriter   *gzip.Writer
	ZSTDWriter   *zstd.Encoder
	FileWriter   *bufio.Writer
	FileName     string
	Compression  string
	ParallelGZIP bool
}

Writer writes WARC records to WARC files.

func NewWriter

func NewWriter(writer io.Writer, fileName string, compression string, contentLengthHeader string, newFileCreation bool, dictionary []byte) (*Writer, error)

NewWriter creates a new WARC writer.

func (*Writer) CloseCompressedWriter added in v0.8.20

func (w *Writer) CloseCompressedWriter() (err error)

func (*Writer) WriteInfoRecord

func (w *Writer) WriteInfoRecord(payload map[string]string) (recordID string, err error)

WriteInfoRecord method can be used to write informations record to the WARC file

func (*Writer) WriteRecord

func (w *Writer) WriteRecord(r *Record) (recordID string, err error)

WriteRecord writes a record to the underlying WARC file. A record consists of a version string, the record header followed by a record content block and two newlines:

Version CLRF
Header-Key: Header-Value CLRF
CLRF
Content
CLRF
CLRF

Directories

Path Synopsis
pkg

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL
JackTT - Gopher 🇻🇳