Documentation
¶
Index ¶
- Variables
- func GetNextIP(availableIPs *availableIPs) net.IP
- func GetSHA1(r io.Reader) string
- func GetSHA256(r io.Reader) string
- func GetSHA256Base16(r io.Reader) string
- func NewDecompressionReader(r io.Reader) (io.Reader, error)
- type CustomHTTPClient
- type DedupeOptions
- type Error
- type HTTPClientSettings
- type Header
- type Reader
- type Record
- type RecordBatch
- type RotatorSettings
- type WaitGroupWithCount
- type Writer
Constants ¶
This section is empty.
Variables ¶
var ( IPv6 *availableIPs IPv4 *availableIPs )
var ( // Create a counter to keep track of the number of bytes written to WARC files // and the number of bytes deduped DataTotal *ratecounter.Counter RemoteDedupeTotal *ratecounter.Counter LocalDedupeTotal *ratecounter.Counter )
Functions ¶
func GetSHA256Base16 ¶ added in v0.8.37
Types ¶
type CustomHTTPClient ¶ added in v0.7.0
type CustomHTTPClient struct { WaitGroup *WaitGroupWithCount ErrChan chan *Error WARCWriter chan *RecordBatch http.Client TempDir string WARCWriterDoneChannels []chan bool TLSHandshakeTimeout time.Duration MaxReadBeforeTruncate int FullOnDisk bool // MaxRAMUsageFraction is the fraction of system RAM above which we'll force spooling to disk. For example, 0.5 = 50%. // If set to <= 0, the default value is DefaultMaxRAMUsageFraction. MaxRAMUsageFraction float64 // contains filtered or unexported fields }
func NewWARCWritingHTTPClient ¶ added in v0.7.0
func NewWARCWritingHTTPClient(HTTPClientSettings HTTPClientSettings) (httpClient *CustomHTTPClient, err error)
func (*CustomHTTPClient) Close ¶ added in v0.7.0
func (c *CustomHTTPClient) Close() error
func (*CustomHTTPClient) WriteRecord ¶ added in v0.8.50
func (c *CustomHTTPClient) WriteRecord(WARCTargetURI, WARCType, contentType, payloadString string, payloadReader io.Reader)
type DedupeOptions ¶ added in v0.8.0
type HTTPClientSettings ¶ added in v0.8.14
type HTTPClientSettings struct { RotatorSettings *RotatorSettings Proxy string TempDir string DNSServer string SkipHTTPStatusCodes []int DNSServers []string DedupeOptions DedupeOptions DialTimeout time.Duration ResponseHeaderTimeout time.Duration DNSResolutionTimeout time.Duration DNSRecordsTTL time.Duration DNSCacheSize int TLSHandshakeTimeout time.Duration TCPTimeout time.Duration MaxReadBeforeTruncate int DecompressBody bool FollowRedirects bool FullOnDisk bool MaxRAMUsageFraction float64 VerifyCerts bool RandomLocalIP bool DisableIPv4 bool DisableIPv6 bool IPv6AnyIP bool }
type Header ¶
Header provides information about the WARC record. It stores WARC record field names and their values. Since WARC field names are case-insensitive, the Header methods are case-insensitive as well.
type Reader ¶
type Reader struct {
// contains filtered or unexported fields
}
Reader store the bufio.Reader and gzip.Reader for a WARC file
func NewReader ¶
func NewReader(reader io.ReadCloser) (*Reader, error)
NewReader returns a new WARC reader
func (*Reader) ReadRecord ¶
ReadRecord reads the next record from the opened WARC file returns:
- Record: if an error occurred, record **may be** nil. if eol is true, record **must be** nil.
- bool (eol): if true, we readed all records successfully.
- error: error
type Record ¶
type Record struct { Header Header Content spooledtempfile.ReadWriteSeekCloser Version string // WARC/1.0, WARC/1.1 ... }
Record represents a WARC record.
type RecordBatch ¶
RecordBatch is a structure that contains a bunch of records to be written at the same time, and a common capture timestamp. FeedbackChan is used to signal when the records have been written.
func NewRecordBatch ¶
func NewRecordBatch(feedbackChan chan struct{}) *RecordBatch
NewRecordBatch creates a record batch, it also initialize the capture time.
type RotatorSettings ¶
type RotatorSettings struct { // Content of the warcinfo record that will be written // to all WARC files WarcinfoContent Header // Prefix used for WARC filenames, WARC 1.1 specifications // recommend to name files this way: // Prefix-Timestamp-Serial-Crawlhost.warc.gz Prefix string // Compression algorithm to use Compression string // Path to a ZSTD compression dictionary to embed (and use) in .warc.zst files CompressionDictionary string // Directory where the created WARC files will be stored, // default will be the current directory OutputDirectory string // WarcSize is in Megabytes WarcSize float64 // WARCWriterPoolSize defines the number of parallel WARC writers WARCWriterPoolSize int }
RotatorSettings is used to store the settings needed by recordWriter to write WARC files
func NewRotatorSettings ¶
func NewRotatorSettings() *RotatorSettings
NewRotatorSettings creates a RotatorSettings structure and initialize it with default values
func (*RotatorSettings) NewWARCRotator ¶
func (s *RotatorSettings) NewWARCRotator() (recordWriterChan chan *RecordBatch, doneChannels []chan bool, err error)
NewWARCRotator creates and return a channel that can be used to communicate records to be written to WARC files to the recordWriter function running in a goroutine
type WaitGroupWithCount ¶ added in v0.8.18
func (*WaitGroupWithCount) Add ¶ added in v0.8.18
func (wg *WaitGroupWithCount) Add(delta int)
func (*WaitGroupWithCount) Done ¶ added in v0.8.18
func (wg *WaitGroupWithCount) Done()
func (*WaitGroupWithCount) Size ¶ added in v0.8.18
func (wg *WaitGroupWithCount) Size() int
type Writer ¶
type Writer struct { GZIPWriter *gzip.Writer ZSTDWriter *zstd.Encoder FileWriter *bufio.Writer FileName string Compression string ParallelGZIP bool }
Writer writes WARC records to WARC files.
func NewWriter ¶
func NewWriter(writer io.Writer, fileName string, compression string, contentLengthHeader string, newFileCreation bool, dictionary []byte) (*Writer, error)
NewWriter creates a new WARC writer.
func (*Writer) CloseCompressedWriter ¶ added in v0.8.20
func (*Writer) WriteInfoRecord ¶
WriteInfoRecord method can be used to write informations record to the WARC file
func (*Writer) WriteRecord ¶
WriteRecord writes a record to the underlying WARC file. A record consists of a version string, the record header followed by a record content block and two newlines:
Version CLRF Header-Key: Header-Value CLRF CLRF Content CLRF CLRF