PHP Class Graby\Extractor\ContentExtractor

Uses patterns specified in site config files and auto detection (hNews/PHP Readability) to extract content from HTML files.
Datei anzeigen Open project: j0k3r/graby Class Usage Examples

Public Properties

Property Type Description
$readability

Public Methods

Method Description
__construct ( array $config = [], Psr\Log\LoggerInterface $logger = null, ConfigBuilder $configBuilder = null )
buildSiteConfig ( string $url, string $html = '', boolean $addToCache = true ) : SiteConfig Returns SiteConfig instance (joined in order: exact match, wildcard, fingerprint, global, default).
findHostUsingFingerprints ( string $html ) : string | false Try to find a host depending on a meta that can be in the html.
getContent ( )
getLanguage ( )
getNextPageUrl ( )
getSiteConfig ( )
getTitle ( )
process ( string $html, string $url, SiteConfig $siteConfig = null, boolean $smartTidy = true ) : boolean $smartTidy indicates that if tidy is used and no results are produced, we will try again without it.
reset ( )
setLogger ( Psr\Log\LoggerInterface $logger )

Private Methods

Method Description
extractBody ( boolean $detectBody, string $xpathExpression, DOMNode $node, string $type ) : boolean Extract body from a given CSS for a node.
extractTitle ( boolean $detectTitle, string $cssClass, DOMNode $node, string $logMessage ) : boolean Extract title for a given CSS class a node.
hasElements ( DOMNodeList $elems ) : boolean Check if given node list exists and has length more than 0.
removeElements ( DOMNodeList $elems, string $logMessage = null ) Remove elements.

Method Details

__construct() public method

public __construct ( array $config = [], Psr\Log\LoggerInterface $logger = null, ConfigBuilder $configBuilder = null )
$config array
$logger Psr\Log\LoggerInterface
$configBuilder Graby\SiteConfig\ConfigBuilder

buildSiteConfig() public method

Returns SiteConfig instance (joined in order: exact match, wildcard, fingerprint, global, default).
public buildSiteConfig ( string $url, string $html = '', boolean $addToCache = true ) : SiteConfig
$url string
$html string
$addToCache boolean
return Graby\SiteConfig\SiteConfig

findHostUsingFingerprints() public method

It allow to determine if a website is generated using Wordpress, Blogger, etc ..
public findHostUsingFingerprints ( string $html ) : string | false
$html string
return string | false

getContent() public method

public getContent ( )

getLanguage() public method

public getLanguage ( )

getNextPageUrl() public method

public getNextPageUrl ( )

getSiteConfig() public method

public getSiteConfig ( )

getTitle() public method

public getTitle ( )

process() public method

Tidy helps us deal with PHP's patchy HTML parsing most of the time but it has problems of its own which we try to avoid with this option.
public process ( string $html, string $url, SiteConfig $siteConfig = null, boolean $smartTidy = true ) : boolean
$html string
$url string
$siteConfig Graby\SiteConfig\SiteConfig Will avoid to recalculate the site config
$smartTidy boolean Do we need to tidy the html ?
return boolean true on success, false on failure

reset() public method

public reset ( )

setLogger() public method

public setLogger ( Psr\Log\LoggerInterface $logger )
$logger Psr\Log\LoggerInterface

Property Details

$readability public_oe property

public $readability