PHP Class Graby\SiteConfig\SiteConfig

Each instance of this class should hold extraction patterns and other directives for a website. See ContentExtractor class to see how it's used.
Author: Keyvan Minoukadeh
Datei anzeigen Open project: j0k3r/graby

Public Properties

Property Type Description
$author Use first matching element as author (0 or more xpath expressions)
$autodetect_on_failure bool or null if undeclared
$body Use first matching element as body (0 or more xpath expressions)
$cache_key the options below cannot be set in the config files which this class represents
$date Use first matching element as date (0 or more xpath expressions)
$find_string Strings to search for in HTML before processing begins (used with $replace_string)
$http_header Additional HTTP headers to send
$login_extra_fields Extra fields to POST to the site's login form.
$login_password_field string Name of the site's login form password field. Example: password.
$login_uri string Site's login form URI, if applicable.
$login_username_field string Name of the site's login form username field. Example: username.
$next_page_link
$not_logged_in_xpath string XPath query to detect if login is requested in a page from the site.
$parser string or null if undeclared
$prune bool or null if undeclared
$replace_string Strings to replace those found in $find_string before HTML processing begins
$requires_login boolean If fetching the site's content requires to authentify.
$single_page_link we will retrieve that page and the rest of the options in this config will be applied to the new page.
$strip Strip elements matching these xpath expressions (0 or more)
$strip_id_or_class Strip elements which contain these strings (0 or more) in the id or class attribute
$strip_image_src Strip images which contain these strings (0 or more) in the src attribute
$test_url Test URL - if present, can be used to test the config above
$tidy Process HTML with tidy before creating DOM (bool or null if undeclared)
$title Use first matching element as title (0 or more xpath expressions)

Protected Properties

Property Type Description
$default_autodetect_on_failure
$default_parser
$default_prune
$default_tidy

Public Methods

Method Description
autodetect_on_failure ( boolean $use_default = true ) : boolean | null Autodetect title/body if xpath expressions fail to produce results.
parser ( boolean $use_default = true ) : string | null Which parser to use for turning raw HTML into a DOMDocument (either 'libxml' or 'html5lib').
prune ( boolean $use_default = true ) : boolean | null Clean up content block - attempt to remove elements that appear to be superfluous.
tidy ( boolean $use_default = true ) : boolean | null Process HTML with tidy before creating DOM (bool or null if undeclared).

Method Details

autodetect_on_failure() public method

Autodetect title/body if xpath expressions fail to produce results.
public autodetect_on_failure ( boolean $use_default = true ) : boolean | null
$use_default boolean
return boolean | null

parser() public method

Which parser to use for turning raw HTML into a DOMDocument (either 'libxml' or 'html5lib').
public parser ( boolean $use_default = true ) : string | null
$use_default boolean
return string | null

prune() public method

Clean up content block - attempt to remove elements that appear to be superfluous.
public prune ( boolean $use_default = true ) : boolean | null
$use_default boolean
return boolean | null

tidy() public method

Process HTML with tidy before creating DOM (bool or null if undeclared).
public tidy ( boolean $use_default = true ) : boolean | null
$use_default boolean
return boolean | null

Property Details

$author public_oe property

Use first matching element as author (0 or more xpath expressions)
public $author

$autodetect_on_failure public_oe property

bool or null if undeclared
public $autodetect_on_failure

$body public_oe property

Use first matching element as body (0 or more xpath expressions)
public $body

$cache_key public_oe property

the options below cannot be set in the config files which this class represents
public $cache_key

$date public_oe property

Use first matching element as date (0 or more xpath expressions)
public $date

$default_autodetect_on_failure protected_oe property

protected $default_autodetect_on_failure

$default_parser protected_oe property

protected $default_parser

$default_prune protected_oe property

protected $default_prune

$default_tidy protected_oe property

protected $default_tidy

$find_string public_oe property

Strings to search for in HTML before processing begins (used with $replace_string)
public $find_string

$http_header public_oe property

Additional HTTP headers to send
public $http_header

$login_extra_fields public_oe property

Extra fields to POST to the site's login form.
public $login_extra_fields

$login_password_field public_oe property

Name of the site's login form password field. Example: password.
public string $login_password_field
return string

$login_uri public_oe property

Site's login form URI, if applicable.
public string $login_uri
return string

$login_username_field public_oe property

Name of the site's login form username field. Example: username.
public string $login_username_field
return string

$not_logged_in_xpath public_oe property

XPath query to detect if login is requested in a page from the site.
public string $not_logged_in_xpath
return string

$parser public_oe property

string or null if undeclared
public $parser

$prune public_oe property

bool or null if undeclared
public $prune

$replace_string public_oe property

Strings to replace those found in $find_string before HTML processing begins
public $replace_string

$requires_login public_oe property

If fetching the site's content requires to authentify.
public bool $requires_login
return boolean

$strip public_oe property

Strip elements matching these xpath expressions (0 or more)
public $strip

$strip_id_or_class public_oe property

Strip elements which contain these strings (0 or more) in the id or class attribute
public $strip_id_or_class

$strip_image_src public_oe property

Strip images which contain these strings (0 or more) in the src attribute
public $strip_image_src

$test_url public_oe property

Test URL - if present, can be used to test the config above
public $test_url

$tidy public_oe property

Process HTML with tidy before creating DOM (bool or null if undeclared)
public $tidy

$title public_oe property

Use first matching element as title (0 or more xpath expressions)
public $title