content

package
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 24, 2025 License: Apache-2.0 Imports: 23 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CleanString

func CleanString(content, langCode string) string

Types

type Basic

type Basic struct {
	// contains filtered or unexported fields
}

Basic is the simplest Extractor implementation.

func NewBasicExtractor

func NewBasicExtractor(logger log.Logger) (*Basic, error)

NewBasicExtractor creates a new Basic instance.

func (Basic) Extract

Extract literally just rearranges the inputs and processes them into a Document.

type Document

type Document struct {
	Title    string
	Name     string
	Content  string
	Size     uint64
	Mtime    string
	MimeType string
	Tags     []string
	Audio    *libregraph.Audio          `json:"audio,omitempty"`
	Image    *libregraph.Image          `json:"image,omitempty"`
	Location *libregraph.GeoCoordinates `json:"location,omitempty"`
	Photo    *libregraph.Photo          `json:"photo,omitempty"`
}

Document wraps all resource meta fields, it is used as a content extraction result.

type Extractor

type Extractor interface {
	Extract(ctx context.Context, ri *provider.ResourceInfo) (Document, error)
}

Extractor is responsible to extract content and meta information from documents.

type Retriever

type Retriever interface {
	Retrieve(ctx context.Context, rID *provider.ResourceId) (io.ReadCloser, error)
}

Retriever is the interface that wraps the basic Retrieve method. 🐕 It requests and then returns a resource from the underlying storage.

type Tika

type Tika struct {
	*Basic
	Retriever

	ContentExtractionSizeLimit uint64
	CleanStopWords             bool
	// contains filtered or unexported fields
}

Tika is used to extract content from a resource, it uses apache tika to retrieve all the data.

func NewTikaExtractor

func NewTikaExtractor(gatewaySelector pool.Selectable[gateway.GatewayAPIClient], logger log.Logger, cfg *config.Config) (*Tika, error)

NewTikaExtractor creates a new Tika instance.

func (Tika) Extract

func (t Tika) Extract(ctx context.Context, ri *provider.ResourceInfo) (Document, error)

Extract loads a resource from its underlying storage, passes it to tika and processes the result into a Document.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL
JackTT - Gopher 🇻🇳