harvester

package
v0.0.0-...-08bcabf Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 13, 2015 License: GPL-3.0 Imports: 27 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Btoi

func Btoi(b bool) int

Simple boolean to integer

func DetectContributorType

func DetectContributorType(name string, gender int) string

Attempts to determine the contributor type (person, company, etc.) when not provided by a service API. In order to do this, we need a few values to test. TODO: More work on this...

func DetectGender

func DetectGender(name string) int

Detects gender based on US Census database, returns 0 for unknown, -1 for female, and 1 for male

func ExpandUrl

func ExpandUrl(url string) string

Gets the final URL given a short URL (or one that has redirects)

func FacebookAccountDetails

func FacebookAccountDetails(territoryName string, account string)

Harvests Facebook account details to track changes in likes, etc. (only for public pages)

func FacebookPostsOut

func FacebookPostsOut(posts []FacebookPost, territoryName string, params FacebookParams) (int, string, time.Time)

Takes an array of Post structs and converts it to JSON and logs to file (to be picked up by Fluentd, Logstash, Ik, etc.)

func GetHarvestMd5

func GetHarvestMd5(text string) string

Turns the harvest id into an md5 string (a simple concatenation would work but some databases such as MySQL have a limit on unique key values so md5 fits without worry)

func GetKeywords

func GetKeywords(text string, minSize int, limit int) []string

func GooglePlusAccountDetails

func GooglePlusAccountDetails(territoryName string, account string)

Harvests Google+ account details to track changes in followers, etc. (NOTE: Pages can't currently be tracked by the existing API, it's invite only)

func GooglePlusActivityByAccount

func GooglePlusActivityByAccount(territoryName string, harvestState config.HarvestState, account string, options url.Values) (url.Values, config.HarvestState)

Gets public Google+ activities (posts) by account.

func GooglePlusActivitySearch

func GooglePlusActivitySearch(territoryName string, harvestState config.HarvestState, query string, options url.Values) (url.Values, config.HarvestState)

Gets Google+ activities (posts) by searching for a keyword.

func InstagramAccountDetails

func InstagramAccountDetails(territoryName string, account string)

Harvests Instagram account details to track changes in followers, etc.

func InstagramFindTags

func InstagramFindTags(keyword string) string

Try to find tags based on a keyword (just return one for now, that's all we need for our purposes)

func InstagramSearch

func InstagramSearch(territoryName string, harvestState config.HarvestState, tag string, options url.Values) (url.Values, config.HarvestState)

Get recent Instagram for media related to specific tags on Instagram

func IsQuestion

func IsQuestion(text string, regexString ...string) bool

Detects questions in messages

func IsStopKeyword

func IsStopKeyword(word string) bool

A list of stop words for keyword extraction (also available under data/keyword-stop-list.txt - mostly, more were added after testing)

func LocaleToLanguageISO

func LocaleToLanguageISO(code ...string) string

Simple method for converting locale values like "en_US" (or even en-US) to ISO 639-1 (which would just be "en")

func Log

func Log(event []byte, channelName string)

Sends to the buffered channel to eventually flush to disk.

func LogJson

func LogJson(message interface{}, channelName string)

Converts the various things to JSON first before sending those bytes to Log()

func New

func New(configuration config.SocialHarvestConf, database *config.SocialHarvestDB)

Sets up a new harvester with the given configuration (which is comprised of several "services")

func NewFacebook

func NewFacebook(servicesConfig config.ServicesConfig)

Set the appToken for future use (global)

func NewFacebookTerritoryCredentials

func NewFacebookTerritoryCredentials(territory string)

If the territory has a different appToken to use

func NewGenderData

func NewGenderData(femaleFilename string, maleFilename string)

Load data from CSV files in order to detect gender. If new files are being used, call this again.

func NewGooglePlus

func NewGooglePlus(servicesConfig config.ServicesConfig)

func NewGooglePlusTerritoryCredentials

func NewGooglePlusTerritoryCredentials(territory string)

If the territory has different keys to use

func NewInstagram

func NewInstagram(servicesConfig config.ServicesConfig)

Set the client for future use

func NewInstagramTerritoryCredentials

func NewInstagramTerritoryCredentials(territory string)

If the territory has different keys to use

func NewLoggers

func NewLoggers(dir string)

Creates and configures new workers on each of the logging channels and sets the directory path to store the log files.

func NewTwitter

func NewTwitter(servicesConfig config.ServicesConfig)

func NewTwitterTerritoryCredentials

func NewTwitterTerritoryCredentials(territory string)

If the territory has different keys to use

func NewYouTube

func NewYouTube(servicesConfig config.ServicesConfig)

func NewYouTubeTerritoryCredentials

func NewYouTubeTerritoryCredentials(territory string)

If the territory has different keys to use

func StoreHarvestedData

func StoreHarvestedData(message interface{})

Rather than using an observer, just call this function instead (the observer was causing memory leaks) TODO: Look back into channels in the future because I like the idea of pub/sub. In the future it could expand into something useful. The thing I don't like (and why I used the observer) is passing all the configuration stuff around.

func TwitterAccountDetails

func TwitterAccountDetails(territoryName string, account string)

Harvests Twitter account details to track changes in followers, etc.

func TwitterAccountStream

func TwitterAccountStream(territoryName string, harvestState config.HarvestState, options url.Values) (url.Values, config.HarvestState)

Harvests from a specific Twitter account stream

func TwitterSearch

func TwitterSearch(territoryName string, harvestState config.HarvestState, query string, options url.Values) (url.Values, config.HarvestState)

Search for status updates and just pass the Tweet along (no special mapping required like FacebookPost{} because the Tweet struct is used across multiple API calls unlike Facebook) All "search" functions (and anything that gets data from an API) will now normalize the data, mapping it to a Social Harvest struct. This means there will be no way to get the original data from the service (back in the main app or from any other Go package that imports the harvester). This is fine because if someone wanted the original data, they could use packages like anaconda directly. What happens now is all data pulled from earch service's API will be sent to a channel (the harvester observer). However, this function should NOT be called in a go-subroutine though. We don't want to make multiple API calls in parallel (rate limits). NOTE: The number of items sent to the observer will be returned along with the last message's time and id. The main package can record this in the harvest logs/table. The harvester will not keep track of this information itself. Its only job is to gather data, send it to the channel and report back on how much was sent (and the last id/time). Period. It doens't care if the data is stored in a database, logged, or streamed out from an API. It just harvests and sends without looking or caring. Whereas previously it would be doing the db calls and logging, etc. This has now all been taken care of with the observer. All of these other processes simply subscribe and listen.

Always passed in first (always): the territory name, and the position in the harvest (HarvestState) ... the rest are going to vary based on the API but typically are the query and options @return options(for pagination), count of items, last id, last time.

func YouTubeAccountDetails

func YouTubeAccountDetails(territoryName string, account string)

Harvests YouTube channel details to track changes in subscribers. (in theory this could be a comma separated list of account names)

Types

type FacebookAccount

type FacebookAccount struct {
	// "id" must exist in response. note the leading comma.
	Id              string `json:"id,required"`
	About           string `json:"about"`
	Category        string `json:"category"`
	Checkins        int    `json:"checkins"`
	CompanyOverview string `json:"company_overview"`
	Description     string `json:"description"`
	Founded         string `json:"founded"`
	GeneralInfo     string `json:"general_info"`
	Likes           int    `json:"likes"`
	Link            string `json:"link"`
	Location        struct {
		Street    string  `json:"street"`
		City      string  `json:"city"`
		State     string  `json:"state"`
		Zip       string  `json:"zip"`
		Country   string  `json:"country"`
		Longitude float64 `json:"longitude"`
		Latitude  float64 `json:"latitude"`
	} `json:"location"`
	Name              string `json:"name"`
	Phone             string `json:"phone"`
	TalkingAboutCount int    `json:"talking_about_count"`
	WereHereCount     int    `json:"were_here_count"`
	Username          string `json:"username"`
	Website           string `json:"website"`
	Products          string `json:"products"`
	// User specific (the above is a mix of page and user)
	Gender    string `json:"gender"`
	Locale    string `json:"locale"`
	FirstName string `json:"first_name"`
	LastName  string `json:"last_name"`
}

Facebook accounts can be for a user or a page

func FacebookGetUserInfo

func FacebookGetUserInfo(id string, params FacebookParams) FacebookAccount

Gets basic info about an account on Facebook

type FacebookParams

type FacebookParams struct {
	IncludeEntities string `url:"include_entities,omitempty"`
	Limit           string `url:"limit,omitempty"`
	Count           string `url:"count,omitempty"`
	Type            string `url:"type,omitempty"`
	Lang            string `url:"lang,omitempty"`
	Q               string `url:"q,omitempty"`
	AccessToken     string `url:"access_token,omitempty"`
	Until           string `url:"until,omitempty"`
	Since           string `url:"since,omitempty"`
}

func FacebookFeed

func FacebookFeed(territoryName string, harvestState config.HarvestState, account string, params FacebookParams) (FacebookParams, config.HarvestState)

Gets the public posts for a given user or page id (or name actually)

func FacebookSearch

func FacebookSearch(territoryName string, harvestState config.HarvestState, params FacebookParams) (FacebookParams, config.HarvestState)

Searches public posts on Facebook

type FacebookPost

type FacebookPost struct {
	// "id" must exist in response. note the leading comma.
	Id   string `json:"id,required"`
	From struct {
		Id       string `json:"id"`
		Name     string `json:"name"`
		Category string `json:"category"`
	} `json:"from"`
	To struct {
		Data []struct {
			Id       string `json:"id"`
			Name     string `json:"name"`
			Category string `json:"category"`
		} `json:"data"`
	} `json:"to"`
	CreatedTime string `json:"created_time"`
	UpdatedTime string `json:"updated_time"`
	Message     string `json:"message"`
	Description string `json:"description"`
	Caption     string `json:"caption"`
	Picture     string `json:"picture"`
	Source      string `json:"source"`
	Link        string `json:"link"`
	Shares      struct {
		Count int `json:"count"`
	} `json:"shares"`
	Name string `json:"name"`
	// Should always be "post" right? No, facebook also includes "status" and "link" and "photo" in there, even with the type param set to post. Seems like something changed/broke.
	Type string `json:"type"`
	// This can tell us if the user is posting from a mobile device...with some logic. Or just which client apps/SaaS' are most popular to post from (also true for Twitter and could be good data to have).
	Application struct {
		Name      string `json:"name"`
		Namespace string `json:"namespace"`
		Id        string `json:"id"`
	} `json:"application"`
	MessageTags map[string][]*MessageTag `json:"message_tags"`
	StoryTags   map[string][]*MessageTag `json:"story_tags"`
	Story       string                   `json:"story"`
	// Typically accompanies items of type photo.
	ObjectId string `json:"object_id"`

	// This only exists on user/page /feed items...and it'll usually be "shared_story" but sometimes I've seen "mobile_status_update" ... which tells us the user is on a mobile device.
	// Is it important to keep? I don't know. Probably not right now.
	StatusType string `json:"status_type"`
}

type MessageTag

type MessageTag struct {
	Id   string `json:"id"`
	Name string `json:"name"`
	Type string `json:"type"`
}

type PagingResult

type PagingResult struct {
	Next     string `json:"next" url:"next"`
	Previous string `json:"previous" url:"previous"`
}

type TimeoutTransport

type TimeoutTransport struct {
	http.Transport
	RoundTripTimeout time.Duration
}

func (*TimeoutTransport) RoundTrip

func (t *TimeoutTransport) RoundTrip(req *http.Request) (*http.Response, error)

If you don't set RoundTrip on TimeoutTransport, this will always timeout at 0

type UsCensusName

type UsCensusName struct {
	Name    string
	Freq    float64
	CumFreq float64
	Rank    int
}

For determining gender, we use the US Census database https://www.census.gov/genealogy/www/data/1990surnames/names_files.html Note: we could also stistically guess ethnicity, https://www.census.gov/genealogy/www/data/2000surnames/index.html Frequency is the one we want. Cumulative frequency is in relation to all names in the database. So if there was a tie for example, "Pat" being both a male and female name...We could look at the cumulative to see if the Census saw more Pats who were male vs. female... This should be extremely rare and maybe not a great way to break ties, but works.

type Worker

type Worker struct {
	// contains filtered or unexported fields
}

func NewWorker

func NewWorker(id int, series string) (w *Worker)

Each worker gets an id and a series name which get combined for a file name and directory within the root directory defined in the Social Harvest configuration.

func (*Worker) Save

func (w *Worker) Save()

Writes the buffer to a temporary file that gets moved to it's final location. Logging is split up among multiple files by worker id and time.

func (*Worker) Work

func (w *Worker) Work(channelName chan []byte)

Assigns a worker to work on the given channel.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL