baleen

package module
v0.2.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 13, 2023 License: AGPL-3.0 Imports: 20 Imported by: 0

README

Baleen

An automated ingestion service of RSS feeds to construct a corpus for NLP research.

GoDoc Go Report Card

Current overview:

  • Golang ingestion system that fetches RSS feeds and stores raw data into S3 for archive and analytics.
  • Web-based RSS feed management system that will allow us to easily manage sources
  • Focus on fetching full text by following links in the RSS feed
  • Feed data quality measurements with language statistics, e.g. words, vocab, etc. rate of corpus growth, number of entities, etc. (we should look at prose for this)
  • JSON based logging with limited retention so we don’t fill up our server with logs - tracking of aggregate metrics over time so we know what’s going on and if it's working.
  • Produce model based translations for sentences and paragraphs from the source language to target languages; crowdsource feedback by creating an app that allows bilingual users to say if a translation is good or not to establish annotations.
  • Annotation quality assessment tools and gamification.
  • Estimated cost with 3 yr reserved instance - $64.04 per month (mostly EBS).

Notes

Documentation

Overview

Package baleen is the top level library of the baleen language ingestion service. This library provides the primary components for running the service as a long running background daemon including the main service itself, configuration and other utilities.

Index

Constants

View Source
const (
	TopicSubscriptions = "subscriptions"
	TopicFeeds         = "feeds"
	TopicDocuments     = "documents"
)

Names of available topics

View Source
const (
	VersionMajor         = 0
	VersionMinor         = 2
	VersionPatch         = 3
	VersionReleaseLevel  = "beta"
	VersionReleaseNumber = 5
)

Version component constants for the current build.

Variables

View Source
var (
	ErrUnhandledType = errors.New("ensign type not handled")
	ErrUnhandledMIME = errors.New("ensign mimetype not handled")
)
View Source
var GitVersion string

Set the GitVersion via -ldflags="-X 'github.com/rotationalio/baleen/pkg.GitVersion=$(git rev-parse --short HEAD)'"

Functions

func CreateEnsignPublisher

func CreateEnsignPublisher(conf config.EnsignConfig, logger watermill.LoggerAdapter) (message.Publisher, error)

func CreateEnsignSubscriber

func CreateEnsignSubscriber(conf config.EnsignConfig, logger watermill.LoggerAdapter) (message.Subscriber, error)

func CreateKafkaPublisher

func CreateKafkaPublisher(conf config.KafkaConfig, logger watermill.LoggerAdapter) (message.Publisher, error)

func CreateKafkaSubscriber

func CreateKafkaSubscriber(conf config.KafkaConfig, logger watermill.LoggerAdapter) (message.Subscriber, error)

func PostFetch

func PostFetch(msg *message.Message) (_ []*message.Message, err error)

func TypeFilter

func TypeFilter(mime string, etypes ...string) message.HandlerMiddleware

func Version

func Version() string

Version returns the semantic version for the current build.

Types

type Baleen

type Baleen struct {
	// contains filtered or unexported fields
}

Baleen is essentially a wrapper for a watermill router that configures different event handlers depending on the context of the process. Calling Run() will start the Baleen service, which will handle incoming events and dispatch new events.

func New

func New(conf config.Config) (svc *Baleen, err error)

func (*Baleen) AddFeedSync

func (s *Baleen) AddFeedSync(conf config.FeedSyncConfig, publisher message.Publisher) (err error)

func (*Baleen) AddPostFetch

func (s *Baleen) AddPostFetch(conf config.PostFetchConfig) error

func (*Baleen) Close

func (s *Baleen) Close() error

func (*Baleen) Run

func (s *Baleen) Run(ctx context.Context) error

type Feed

type Feed struct {
	// contains filtered or unexported fields
}

func (*Feed) Sync

func (f *Feed) Sync() (msgs []*message.Message, err error)

Sync the feed and return the FeedItem events to publish

type FeedSync

type FeedSync struct {
	// contains filtered or unexported fields
}

func NewFeedSync

func NewFeedSync(conf config.FeedSyncConfig, publisher message.Publisher) (*FeedSync, error)

func (*FeedSync) Handle

func (f *FeedSync) Handle(msg *message.Message) (_ []*message.Message, err error)

func (*FeedSync) Start

func (f *FeedSync) Start(r *message.Router) error

func (*FeedSync) Stop

func (f *FeedSync) Stop()

type Manifest

type Manifest map[string]*Feed

func (Manifest) Add

func (m Manifest) Add(info *events.Subscription) *Feed

Add or update the feed to the manifest

Directories

Path Synopsis
cmd
baleen
Package main serves as the primary entry point for launching the Baleen command line application.
Package main serves as the primary entry point for launching the Baleen command line application.
Package events provides data serialization for Baleen-specific events using message pack - a binary JSON compatible serialization format.
Package events provides data serialization for Baleen-specific events using message pack - a binary JSON compatible serialization format.
Package fetch provides a stateful interface for routinely fetching resources from the web.
Package fetch provides a stateful interface for routinely fetching resources from the web.
Package opml implements parsing support for Outline Processor Markup Language - an XML format for creating outlines.
Package opml implements parsing support for Outline Processor Markup Language - an XML format for creating outlines.
Package store contains functions for writing parsed feeds to cloud storage.
Package store contains functions for writing parsed feeds to cloud storage.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL