insidescraper

package module
v0.0.0-...-a01de33 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 9, 2020 License: MIT Imports: 15 Imported by: 0

README

inside-chassidus-scraper

Go through inside chassidus, turn it's structure into json etc.

Note that, for research, make sure javascript is disabled for insidechassidus.com. That provides an accurate view of what the scraper sees.

Status

Note that this no longer works. The website I was scraping changed a lot, and everything is in a diffirent place. However, now it's properly using wordpress, and I can get data from the wordpress APIs directly, which is awesome. This was very useful in it's time, and helped get an app I was working on launched (I'm only now, almost a year after using this, able to use the wordpress APIs), and am grateful to the excellent work of gcolly for making it possible. Code on!

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ContentReference

type ContentReference struct {
	Type DataType

	// Reference can either be the ID of a section or lesson, or a media source URL.
	Reference string
}

ContentReference can refer to any of section, lesson, or media

type Correction

type Correction struct {
	Guesses []string
	// The sections which reference the bad URL.
	Parents      []string
	Is404        bool
	IsConfirmed  bool
	WasCorrected bool
	Source       DataType
}

Correction is a (possible) correction for a missing link.

type DataType

type DataType int

DataType is the type of the site data.

const (
	// SectionType refers to a section.
	SectionType DataType = iota
	// LessonType refers to a lesson.
	LessonType
	// MediaType refers to a media.
	MediaType
)

type InsideScraper

type InsideScraper struct {
	Site Site
	// contains filtered or unexported fields
}

InsideScraper scrapes insidechassidus for lesson structure.

func (*InsideScraper) Scrape

func (scraper *InsideScraper) Scrape(scrapeURL ...string) (err error)

Scrape scrapes the site. It returns an error if there's an error.

type Lesson

type Lesson struct {
	*SiteData
	// ID is the URL of the lessons, if they are from their own page.
	ID    string
	Audio []Media
}

Lesson describes one lesson. It may contain multiple classes.

func (*Lesson) UnmarshalJSON

func (i *Lesson) UnmarshalJSON(data []byte) error

type LessonCounter

type LessonCounter struct {
	Data *Site
	// contains filtered or unexported fields
}

LessonCounter sets the lesson count property of each section. This is kept separate from the scraper so that postfixes etc can be applied to the data before counting everything up.

func MakeCounter

func MakeCounter(data *Site) LessonCounter

MakeCounter creates a new counter with the given site data.

func (*LessonCounter) CountLessons

func (counter *LessonCounter) CountLessons()

CountLessons counts all the lessons, recursively.

type LessonScraper

type LessonScraper struct {
	Row    *goquery.Selection
	Lesson *Lesson
}

LessonScraper gets data about lessons from a column.

func (*LessonScraper) LoadLesson

func (scraper *LessonScraper) LoadLesson()

LoadLesson scrapes the row and returns a structured lesson.

type Media

type Media struct {
	// Note that a media item will *only* have it's own PDF if it was converted from a lesson.
	*SiteData
	Source string
}

Media contains information about a particular piece of media.

func (*Media) UnmarshalJSON

func (i *Media) UnmarshalJSON(data []byte) error

type PostScraper

type PostScraper struct {
	Site    Site
	Missing map[string]Correction
	Empty   map[string]Correction
}

PostScraper goes over the scraped data and fixes it up as much as possible.

func (*PostScraper) FixSite

func (cleaner *PostScraper) FixSite()

FixSite applies fixes to the site data.

func (*PostScraper) GetEmptyCorrections

func (cleaner *PostScraper) GetEmptyCorrections() map[string]Correction

GetEmptyCorrections finds all empty sections (no lessons or subsections) and tries to correct.

func (*PostScraper) GetMissingCorrections

func (cleaner *PostScraper) GetMissingCorrections() map[string]Correction

GetMissingCorrections attempts to find as many missing sections as possible.

type ResolvedSection

type ResolvedSection struct {
	*SiteData

	ID string

	Content []ContentReference

	Audio map[string]Media

	// AudioCount contains the total number of audio classes contained in this section,
	// including all descendant sections.
	AudioCount int
}

ResolvedSection stores optimized data.

type ResolvedSite

type ResolvedSite struct {
	Sections map[string]ResolvedSection
	Lessons  map[string]Lesson
	// IDs of all top level sections.
	TopLevel []TopItem
}

ResolvedSite contains all site data.

type ResolvingItem

type ResolvingItem struct {
	Type DataType

	SectionID string

	// Audio is for when a section consists only of a single audio.
	Audio *Media

	// Lesson is for when a section is really just a lesson.
	Lesson *Lesson
}

ResolvingItem is an interim object during resolution.

type SectionResolver

type SectionResolver struct {
	// Site is the original Site
	Site Site

	// NeweSite is the resolved Site
	ResolvedSite ResolvedSite
}

SectionResolver optimizes data structure.

func (*SectionResolver) ResolveMedia

func (resolver *SectionResolver) ResolveMedia(audio Media, lesson *SiteData) Media

ResolveMedia gives the given media all of its data.

func (*SectionResolver) ResolveSection

func (resolver *SectionResolver) ResolveSection(sectionID string) *ResolvingItem

ResolveSection converts the given section into its most efficiant representation.

func (*SectionResolver) ResolveSite

func (resolver *SectionResolver) ResolveSite()

ResolveSite resolves the Site into an optimized site.

type Site

type Site struct {
	Sections map[string]SiteSection
	Lessons  map[string]Lesson
	// IDs of all top level sections.
	TopLevel []TopItem
}

Site contains all site data.

func (*Site) ConvertToLesson

func (site *Site) ConvertToLesson(sectionID string) error

ConvertToLesson converts the section to a lesson if it only contains single-audio lessons. Returns an error if it can't be done.

type SiteData

type SiteData struct {
	Title       string
	Description string
	// PDFs can pop up at any level.
	// For example, sometimes a section has a pdf for it. This usually (probably always) happens
	// when the section contains only lessons, when it will anyway be converted to a lesson.
	// See https://insidechassidus.org/thought-and-history/123-kabbala-and-philosophy-series/1699-chassidus-understanding-what-can-be-understood-of-g-dliness/section-one-before-logic
	Pdf []string
}

SiteData is a base type used by other site structures.

type SiteSection

type SiteSection struct {
	*SiteData

	ID string
	// Sections contains the ids of all sub sections
	Sections []string
	// IDs of all lessons in this section.
	Lessons []string
	// AudioCount contains the total number of audio classes contained in this section,
	// including  all descendant sections.
	AudioCount int
}

SiteSection describes a section of a site.

func (*SiteSection) UnmarshalJSON

func (i *SiteSection) UnmarshalJSON(data []byte) error

type TopItem

type TopItem struct {
	ID    string
	Image string
}

TopItem is a top level item on the site.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL