PHP Class Graby\Extractor\ContentExtractor

Uses patterns specified in site config files and auto detection (hNews/PHP Readability) to extract content from HTML files.
Afficher le fichier Open project: j0k3r/graby Class Usage Examples

Méthodes publiques

Свойство Type Description
$readability

Méthodes publiques

Méthode Description
__construct ( array $config = [], Psr\Log\LoggerInterface $logger = null, ConfigBuilder $configBuilder = null )
buildSiteConfig ( string $url, string $html = '', boolean $addToCache = true ) : SiteConfig Returns SiteConfig instance (joined in order: exact match, wildcard, fingerprint, global, default).
findHostUsingFingerprints ( string $html ) : string | false Try to find a host depending on a meta that can be in the html.
getContent ( )
getLanguage ( )
getNextPageUrl ( )
getSiteConfig ( )
getTitle ( )
process ( string $html, string $url, SiteConfig $siteConfig = null, boolean $smartTidy = true ) : boolean $smartTidy indicates that if tidy is used and no results are produced, we will try again without it.
reset ( )
setLogger ( Psr\Log\LoggerInterface $logger )

Private Methods

Méthode Description
extractBody ( boolean $detectBody, string $xpathExpression, DOMNode $node, string $type ) : boolean Extract body from a given CSS for a node.
extractTitle ( boolean $detectTitle, string $cssClass, DOMNode $node, string $logMessage ) : boolean Extract title for a given CSS class a node.
hasElements ( DOMNodeList $elems ) : boolean Check if given node list exists and has length more than 0.
removeElements ( DOMNodeList $elems, string $logMessage = null ) Remove elements.

Method Details

__construct() public méthode

public __construct ( array $config = [], Psr\Log\LoggerInterface $logger = null, ConfigBuilder $configBuilder = null )
$config array
$logger Psr\Log\LoggerInterface
$configBuilder Graby\SiteConfig\ConfigBuilder

buildSiteConfig() public méthode

Returns SiteConfig instance (joined in order: exact match, wildcard, fingerprint, global, default).
public buildSiteConfig ( string $url, string $html = '', boolean $addToCache = true ) : SiteConfig
$url string
$html string
$addToCache boolean
Résultat Graby\SiteConfig\SiteConfig

findHostUsingFingerprints() public méthode

It allow to determine if a website is generated using Wordpress, Blogger, etc ..
public findHostUsingFingerprints ( string $html ) : string | false
$html string
Résultat string | false

getContent() public méthode

public getContent ( )

getLanguage() public méthode

public getLanguage ( )

getNextPageUrl() public méthode

public getNextPageUrl ( )

getSiteConfig() public méthode

public getSiteConfig ( )

getTitle() public méthode

public getTitle ( )

process() public méthode

Tidy helps us deal with PHP's patchy HTML parsing most of the time but it has problems of its own which we try to avoid with this option.
public process ( string $html, string $url, SiteConfig $siteConfig = null, boolean $smartTidy = true ) : boolean
$html string
$url string
$siteConfig Graby\SiteConfig\SiteConfig Will avoid to recalculate the site config
$smartTidy boolean Do we need to tidy the html ?
Résultat boolean true on success, false on failure

reset() public méthode

public reset ( )

setLogger() public méthode

public setLogger ( Psr\Log\LoggerInterface $logger )
$logger Psr\Log\LoggerInterface

Property Details

$readability public_oe property

public $readability