datastore

package
v0.0.0-...-9107137 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 27, 2022 License: Apache-2.0 Imports: 9 Imported by: 0

Documentation

Overview

Package datastore provides the datastore to keep track of all the information needed.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DocLinkKey

func DocLinkKey(link DocLink) string

DocLinkKey generates the primary key for the given DocLink. WARNING: Changing this code will break existing databases since existing data will not have compatible keys.

func DriveKey

func DriveKey(id string) string

DriveKey generates the primary key for the given Google Drive file WARNING: Changing this code will break existing databases since existing data will not have compatible keys.

func EntityMentionKey

func EntityMentionKey(m EntityMention) string

EntityMentionKey generates the primary key for the given EntityMention. WARNING: Changing this code will break existing databases since existing data will not have compatible keys.

func GormIgnored

func GormIgnored(typ interface{}) cmp.Option

GormIgnored returns IgnoreFields for the given type e.g GormIgnored(DocReference{})

Types

type Datastore

type Datastore struct {
	// contains filtered or unexported fields
}

func New

func New(dbFile string, logger logr.Logger) (*Datastore, error)

New creates a new datastore.

func (*Datastore) Close

func (d *Datastore) Close() error

func (*Datastore) FindEntity

func (d *Datastore) FindEntity(q EntityQuery) ([]*Entity, error)

FindEntities is a primitive form of entity linking.

TODO(jeremy): How should we handle the case where we could potentially have multiple entries in the database that would match?

func (d *Datastore) ListDocLinks(destId string) ([]*DocLink, error)

ListDocLinks lists all the doc links. destId optional if supplied list all the links pointing at this destination id

func (*Datastore) ListDocReferences

func (d *Datastore) ListDocReferences() ([]*DocReference, error)

ListDocReferences lists all the docreferences.

func (*Datastore) ListEntities

func (d *Datastore) ListEntities() ([]*Entity, error)

ListEntities lists all the entities.

func (*Datastore) ListEntityMentions

func (d *Datastore) ListEntityMentions(docId string) ([]*EntityMention, error)

ListEntityMentions lists all the entity mentions. docId is optional if supplied list all the mentions for the provided doc.

func (*Datastore) ToBeIndexed

func (d *Datastore) ToBeIndexed() ([]*DocReference, error)

ToBeIndexed returns a list of DocReferences that need to be indexed.

func (d *Datastore) UpdateDocLink(l *DocLink) error

UpdateDocLink updates or creates the DocLink

TODO(jeremy): These function needs to be updated to allow for a given link to appear multiple times in a doc. In that case we want to have multiple entries in the doc.

func (*Datastore) UpdateDocReference

func (d *Datastore) UpdateDocReference(r *DocReference) error

UpdateDocReference updates or creates the DocReference

func (*Datastore) UpdateEntity

func (d *Datastore) UpdateEntity(m *Entity) error

UpdateEntity updates or creates the Entity

TODO(jeremy): The semantics for dealing with multiple entities with the same name are ill defined. Right now it is the caller's job to do entity linking before calling UpdateEntity. If an entity with a given name already exists in the database but m represents a different entity with the same name then caller should assign a unique id to it.

func (*Datastore) UpdateEntityMention

func (d *Datastore) UpdateEntityMention(m *EntityMention) error

UpdateEntityMention updates or creates the EntityMention

type DocLink struct {
	// The unique id follows the convention sourceId-destId-startIndex-endindex.
	// This is arguably not space efficient but we can optimize later.
	ID        string `gorm:"primarykey"`
	CreatedAt time.Time
	UpdatedAt time.Time
	DeletedAt gorm.DeletedAt `gorm:"index"`
	// SourceID is the id of the destination doc
	SourceID string `gorm:"index"`
	// DestID is the destination doc
	DestID string `gorm:"index"`
	// URI is the URI the link is pointing to
	URI string
	// Text is the text associated with the link.
	Text string
	// StartIndex of the text for the link.
	StartIndex int64
	// EndIndex of the text for the link.
	EndIndex int64
}

DocLink is a directional link between two docs. We do not rely on GormAssociations for a couple reasons

  1. We want to stick with a CRUD API to allow for more flexible backends
  2. Using associations adds complexity in terms of how it gets used I believe in order to populate associations its doing joins There's also confusion on which fields user should set to update when using a BelongsTo association as there are separate fields for the foreign key and the reference.
  3. GraphQL might be a better API for joins.

A link can appear more than once between two documents; i.e. a given doc can have multiple hyperlinks to another document. Not all links will have destId set.

type DocReference

type DocReference struct {
	// The unique id follows the convention $namespace.id where namespace identifies a namespace with respect
	// to which id is unique. Typically namespace is the source of the file e.g. Google Drive.
	ID        string `gorm:"primarykey"`
	CreatedAt time.Time
	UpdatedAt time.Time
	DeletedAt gorm.DeletedAt `gorm:"index"`

	// The ID of the file in Google Drive. We create a unique index named uid to ensure there is
	// one row for each doc. This could eventually become a composite index because we want to allow for
	// documents in different systems (e.g. Drive and GitHub). In which case the uid index would be a composite
	// key on DriveId and GitHub and only one will be set.
	DriveId  string `gorm:"index:uid,unique"`
	Name     string
	MimeType string

	// TODO(jeremy): We should rename the checksum fields. To be opaque version numbers. They won't always be
	// checksums.
	// Md5Checksum is current checksum
	Md5Checksum string

	// LastIndexedMd5Checksum is the checksum at which it was last indexed
	LastIndexedMd5Checksum string
}

DocReference is a reference to a document stored in some system such as Google Drive.

type DocReferenceIter

type DocReferenceIter func(r *DocReference) error

DocReferenceIter is an iterator over DocReferences

type Entity

type Entity struct {
	ID        string `gorm:"primarykey"`
	CreatedAt time.Time
	UpdatedAt time.Time
	DeletedAt gorm.DeletedAt `gorm:"index"`

	// Name is the canonical name of the entity
	Name string

	// Type of entity
	Type string

	// WikipediaURL associated with this entity if there is one.
	WikipediaUrl string

	// MID is the Google Knowledge Graph MID if there is one
	MID string `gorm:"column:mid"`
}

Entity is a unique entity.

type EntityMention

type EntityMention struct {
	// The unique id follows the convention docId-startIndex-endindex.
	// Assumption is a given range can only be a single entity.
	ID        string `gorm:"primarykey"`
	CreatedAt time.Time
	UpdatedAt time.Time
	DeletedAt gorm.DeletedAt `gorm:"index"`
	// DocID is the id of the doc
	DocID    string `gorm:"index"`
	EntityID string
	// Text associated with the entity
	Text string
	// StartIndex of the text for the link.
	StartIndex int64
	// EndIndex of the text for the link.
	EndIndex int64
}

EntityMention is the mention of some entity in a doc.

We do not rely on GormAssociations for a couple reasons

  1. We want to stick with a CRUD API to allow for more flexible backends
  2. Using associations adds complexity in terms of how it gets used I believe in order to populate associations its doing joins There's also confusion on which fields user should set to update when using a BelongsTo association as there are separate fields for the foreign key and the reference.
  3. GraphQL might be a better API for joins.

A specific entity can appear more than once in a given doc.

TODO(jeremy): We also need an Entity table and should attempt to do some entity linking.

type EntityQuery

type EntityQuery struct {
	Name         string
	WikipediaURL string
	MID          string
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL