hqgourl

package module
v0.0.0-...-0d9326f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 12, 2024 License: MIT Imports: 11 Imported by: 5

README

hqgourl

go report card open issues closed issues license maintenance contribution

A Go(Golang) package for extracting, parsing and manipulating URLs.

Resources

Features

  • Flexible URL extraction from text using regular expressions.
  • Domain parsing into subdomains, root domains, and TLDs.
  • Extends the standard net/url URLs parsing with additional fields.

Installation

go get -v -u github.com/hueristiq/hqgourl

Usage

URL Extraction
package main

import (
    "fmt"
    "github.com/hueristiq/hqgourl"
    "regexp"
)

func main() {
    extractor := hqgourl.NewURLExtractor()
    text := "Check out this website: https://example.com and send an email to info@example.com."
    
    regex := extractor.CompileRegex()
    matches := regex.FindAllString(text, -1)
    
    fmt.Println("Found URLs:", matches)
}

The URLExtractor allows customization of the URL extraction process through various options. For instance, you can specify whether to include URL schemes and hosts in the extraction and provide custom regex patterns for these components.

  • Extracting URLs with Specific Schemes

    extractor := hqgourl.NewURLExtractor(
        hqgourl.URLExtractorWithSchemePattern(`(?:https?|ftp)://`),
    )
    

    This configuration will extract only URLs starting with http, https, or ftp schemes.

  • Extracting URLs with Custom Host Patterns

    extractor := hqgourl.NewURLExtractor(
        hqgourl.URLExtractorWithHostPattern(`(?:www\.)?example\.com`),
    )
    
    

    This setup will extract URLs that have hosts matching www.example.com or example.com.

[!NOTE] Since API is centered around regexp.Regexp, many other methods are available

Domain Parsing
package main

import (
    "fmt"
    "github.com/hueristiq/hqgourl"
)

func main() {
    dp := hqgourl.NewDomainParser()

    parsedDomain := dp.Parse("subdomain.example.com")

    fmt.Printf("Subdomain: %s, Root Domain: %s, TLD: %s\n", parsedDomain.Sub, parsedDomain.Root, parsedDomain.TopLevel)
}
URL Parsing
package main

import (
    "fmt"
    "github.com/hueristiq/hqgourl"
)

func main() {
    up := hqgourl.NewURLParser()

    parsedURL, err := up.Parse("https://subdomain.example.com:8080/path/file.txt")
    if err != nil {
        fmt.Println("Error parsing URL:", err)

        return
    }

    fmt.Printf("Subdomain: %s\n", parsedURL.Domain.Sub)
    fmt.Printf("Root Domain: %s\n", parsedURL.Domain.Root)
    fmt.Printf("TLD: %s\n", parsedURL.Domain.TopLevel)
    fmt.Printf("Port: %d\n", parsedURL.Port)
    fmt.Printf("File Extension: %s\n", parsedURL.Extension)
}

Set a default scheme:

up := hqgourl.NewURLParser(hqgourl.URLParserWithDefaultScheme("https"))

Contributing

Issues and Pull Requests are welcome! Check out the contribution guidelines.

Licensing

This utility is distributed under the MIT license.

Credits

Contributors

Thanks to the amazing contributors for keeping this project alive.

contributors

Similar Projects

Thanks to similar open source projects - check them out, may fit in your needs.

DomainParserurlxxurlsgoware's tldomainsjakewarren's tldomains

Documentation

Index

Constants

View Source
const (
	URLExtractorIPv4Pattern         = `` /* 206-byte string literal not displayed */
	URLExtractorNonEmptyIPv6Pattern = `(?:` +

		`(?:[0-9a-fA-F]{1,4}:){7}(?:[0-9a-fA-F]{1,4}|:)|` +

		`(?:[0-9a-fA-F]{1,4}:){6}(?:` + URLExtractorIPv4Pattern + `|:[0-9a-fA-F]{1,4}|:)|` +

		`(?:[0-9a-fA-F]{1,4}:){5}(?::` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,2}|:)|` +

		`(?:[0-9a-fA-F]{1,4}:){4}(?:(?::[0-9a-fA-F]{1,4}){0,1}:` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,3}|:)|` +

		`(?:[0-9a-fA-F]{1,4}:){3}(?:(?::[0-9a-fA-F]{1,4}){0,2}:` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,4}|:)|` +

		`(?:[0-9a-fA-F]{1,4}:){2}(?:(?::[0-9a-fA-F]{1,4}){0,3}:` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,5}|:)|` +

		`(?:[0-9a-fA-F]{1,4}:){1}(?:(?::[0-9a-fA-F]{1,4}){0,4}:` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,6}|:)|` +

		`:(?:(?::[0-9a-fA-F]{1,4}){0,5}:` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,7})` +
		`)`
	URLExtractorIPv6Pattern = `(?:` + URLExtractorNonEmptyIPv6Pattern + `|::)`

	URLExtractorPortPattern         = `(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-5][0-9]{3}\b)`
	URLExtractorPortOptionalPattern = URLExtractorPortPattern + `?`
)

Variables

View Source
var (
	// URLExtractorSchemePattern defines a general pattern for matching URL schemes.
	// It matches any scheme that starts with alphabetical characters followed by any combination
	// of alphabets, dots, hyphens, or pluses, and ends with "://". It also matches any scheme
	// from a predefined list that does not require authority (host), ending with ":".
	URLExtractorSchemePattern = `(?:[a-zA-Z][a-zA-Z.\-+]*://|` + anyOf(schemes.SchemesNoAuthority...) + `:)`
	// URLExtractorKnownOfficialSchemePattern defines a pattern for matching officially recognized
	// URL schemes. This includes schemes like "http", "https", "ftp", etc., and is strictly based
	// on the schemes defined in the schemes.Schemes slice, ensuring a match ends with "://".
	URLExtractorKnownOfficialSchemePattern = `(?:` + anyOf(schemes.Schemes...) + `://)`
	// URLExtractorKnownUnofficialSchemePattern defines a pattern for matching unofficial or
	// less commonly used URL schemes. Similar to the official pattern but based on the
	// schemes.SchemesUnofficial slice, it supports schemes that might not be universally recognized
	// but are valid in specific contexts, ending with "://".
	URLExtractorKnownUnofficialSchemePattern = `(?:` + anyOf(schemes.SchemesUnofficial...) + `://)`
	// URLExtractorKnownNoAuthoritySchemePattern defines a pattern for matching schemes that
	// do not require an authority (host) component. This is useful for schemes like "mailto:",
	// "tel:", and others where a host is not applicable, ending with ":".
	URLExtractorKnownNoAuthoritySchemePattern = `(?:` + anyOf(schemes.SchemesNoAuthority...) + `:)`
	// URLExtractorKnownSchemePattern combines the patterns for officially recognized,
	// unofficial, and no-authority-required schemes into one comprehensive pattern. It is
	// case-insensitive (noted by "(?i)") and designed to match a wide range of schemes, accommodating
	// the broadest possible set of URLs.
	URLExtractorKnownSchemePattern = `(?:(?i)(?:` + anyOf(schemes.Schemes...) + `|` + anyOf(schemes.SchemesUnofficial...) + `)://|` + anyOf(schemes.SchemesNoAuthority...) + `:)`
)

Functions

This section is empty.

Types

type Domain

type Domain struct {
	Sub      string
	Root     string
	TopLevel string
}

Domain struct represents the structure of a parsed domain name, including its subdomain, root domain, and top-level domain (TLD).

func (*Domain) String

func (d *Domain) String() (domain string)

String assembles the domain components back into a full domain string.

type DomainInterface

type DomainInterface interface {
	String() (domain string)
}

DomainInterface defines a standard interface for any domain representation.

type DomainParser

type DomainParser struct {
	// contains filtered or unexported fields
}

DomainParser encapsulates the logic for parsing full domain strings into their constituent parts: subdomains, root domains, and top-level domains (TLDs). It leverages a suffix array for efficient search and extraction of these components from a full domain string.

func NewDomainParser

func NewDomainParser(opts ...DomainParserOptionsFunc) (dp *DomainParser)

NewDomainParser creates and initializes a DomainParser with a comprehensive list of TLDs, including both standard and pseudo-TLDs. This setup ensures accurate parsing across a wide range of domain names. Additional options can be applied to customize the parser further.

func (*DomainParser) Parse

func (dp *DomainParser) Parse(domain string) (parsedDomain *Domain)

Parse takes a full domain string and splits it into its constituent parts: subdomain, root domain, and TLD. This method efficiently identifies the TLD using a suffix array and separates the remaining parts of the domain accordingly.

type DomainParserInterface

type DomainParserInterface interface {
	Parse(domain string) (parsedDomain *Domain)
	// contains filtered or unexported methods
}

DomainParserInterface defines a standard interface for any DomainParser representation.

type DomainParserOptionsFunc

type DomainParserOptionsFunc func(*DomainParser)

DomainParserOptionsFunc is a function type designed for configuring a DomainParser instance. It allows for the application of customization options, such as specifying custom TLDs.

func DomainParserWithTLDs

func DomainParserWithTLDs(TLDs ...string) DomainParserOptionsFunc

DomainParserWithTLDs allows for the initialization of the DomainParser with a custom set of TLDs. This is particularly useful for applications requiring parsing of non-standard or niche TLDs.

type URL

type URL struct {
	*url.URL // Embedding the standard URL struct for base functionalities.

	Domain    *Domain
	Port      int    // Port number used in the URL.
	Extension string // File extension derived from the URL path.
}

URL extends the standard net/url URL struct with additional domain-related fields. It includes details like subdomain, root domain, and Top-Level Domain (TLD), along with standard URL components. This struct provides a comprehensive representation of a URL.

type URLExtractor

type URLExtractor struct {
	// contains filtered or unexported fields
}

URLExtractor is a struct that configures the URL extraction process. It allows specifying whether to include URL schemes and hosts in the extraction and supports custom regex patterns for these components.

func NewURLExtractor

func NewURLExtractor(opts ...URLExtractorOptionsFunc) (extractor *URLExtractor)

NewURLExtractor creates a new URLExtractor instance with optional configuration. It applies the provided options to the extractor, allowing for customized behavior.

func (*URLExtractor) CompileRegex

func (e *URLExtractor) CompileRegex() (regex *regexp.Regexp)

CompileRegex compiles a regex pattern based on the URLExtractor configuration. It dynamically constructs a regex pattern to accurately capture URLs from text, supporting various URL formats and components. The method ensures the regex captures the longest possible match for a URL, enhancing the accuracy of the extraction process.

type URLExtractorInterface

type URLExtractorInterface interface {
	CompileRegex() (regex *regexp.Regexp)
}

URLExtractorInterface defines the interface for URLExtractor, ensuring it implements certain methods.

type URLExtractorOptionsFunc

type URLExtractorOptionsFunc func(*URLExtractor)

URLExtractorOptionsFunc defines a function type for configuring URLExtractor instances. This approach allows for flexible and fluent configuration of the extractor.

func URLExtractorWithHost

func URLExtractorWithHost() URLExtractorOptionsFunc

URLExtractorWithHost returns an option function to include hosts in the URLs to be extracted. This can be used to ensure that only URLs with specified host components are captured.

func URLExtractorWithHostPattern

func URLExtractorWithHostPattern(pattern string) URLExtractorOptionsFunc

URLExtractorWithHostPattern returns an option function to specify a custom regex pattern for matching URL hosts. This is useful for targeting specific domain names or IP address formats.

func URLExtractorWithScheme

func URLExtractorWithScheme() URLExtractorOptionsFunc

URLExtractorWithScheme returns an option function to include URL schemes in the extraction process.

func URLExtractorWithSchemePattern

func URLExtractorWithSchemePattern(pattern string) URLExtractorOptionsFunc

URLExtractorWithSchemePattern returns an option function to specify a custom regex pattern for matching URL schemes. This allows for fine-tuned control over which schemes are considered valid.

type URLParser

type URLParser struct {
	// contains filtered or unexported fields
}

URLParser encapsulates the logic for parsing URLs with additional domain-specific information. It enhances the standard URL parsing with the extraction of subdomain, root domain, and TLD. It also handles the addition of a default scheme if one is not present in the input URL.

func NewURLParser

func NewURLParser(opts ...URLParserOptionsFunc) (up *URLParser)

NewURLParser creates a new URLParser with the given options. It initializes a DomainParser for parsing domain details and applies any additional configuration options.

func (*URLParser) DefaultScheme

func (up *URLParser) DefaultScheme() (scheme string)

DefaultScheme returns the currently set default scheme of the URLParser.

func (*URLParser) Parse

func (up *URLParser) Parse(rawURL string) (parsedURL *URL, err error)

Parse takes a raw URL string and parses it into a URL struct. It adds domain-specific details like subdomain, root domain, and TLD to the parsed URL. The method also ensures a default scheme is set if the URL does not specify one.

func (*URLParser) WithDefaultScheme

func (up *URLParser) WithDefaultScheme(scheme string)

WithDefaultScheme allows setting a default scheme for the URLParser. This default scheme is used if the input URL doesn't specify a scheme.

type URLParserInterface

type URLParserInterface interface {
	WithDefaultScheme(scheme string)

	DefaultScheme() (scheme string)

	Parse(rawURL string) (parsedURL *URL, err error)
}

URLParserInterface defines the interface for URL parsing functionality.

type URLParserOptionsFunc

type URLParserOptionsFunc func(*URLParser)

URLParserOptionsFunc defines a function type for configuring a URLParser.

func URLParserWithDefaultScheme

func URLParserWithDefaultScheme(scheme string) URLParserOptionsFunc

URLParserWithDefaultScheme returns a URLParserOptionsFunc to set a default scheme. This is useful when parsing URLs that may not have a scheme included.

Directories

Path Synopsis
generate

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL