PHP Класс Graby\Extractor\ContentExtractor

Uses patterns specified in site config files and auto detection (hNews/PHP Readability) to extract content from HTML files.
Показать файл Открыть проект Примеры использования класса

Открытые свойства

Свойство Тип Описание
$readability

Открытые методы

Метод Описание
__construct ( array $config = [], Psr\Log\LoggerInterface $logger = null, ConfigBuilder $configBuilder = null )
buildSiteConfig ( string $url, string $html = '', boolean $addToCache = true ) : SiteConfig Returns SiteConfig instance (joined in order: exact match, wildcard, fingerprint, global, default).
findHostUsingFingerprints ( string $html ) : string | false Try to find a host depending on a meta that can be in the html.
getContent ( )
getLanguage ( )
getNextPageUrl ( )
getSiteConfig ( )
getTitle ( )
process ( string $html, string $url, SiteConfig $siteConfig = null, boolean $smartTidy = true ) : boolean $smartTidy indicates that if tidy is used and no results are produced, we will try again without it.
reset ( )
setLogger ( Psr\Log\LoggerInterface $logger )

Приватные методы

Метод Описание
extractBody ( boolean $detectBody, string $xpathExpression, DOMNode $node, string $type ) : boolean Extract body from a given CSS for a node.
extractTitle ( boolean $detectTitle, string $cssClass, DOMNode $node, string $logMessage ) : boolean Extract title for a given CSS class a node.
hasElements ( DOMNodeList $elems ) : boolean Check if given node list exists and has length more than 0.
removeElements ( DOMNodeList $elems, string $logMessage = null ) Remove elements.

Описание методов

__construct() публичный Метод

public __construct ( array $config = [], Psr\Log\LoggerInterface $logger = null, ConfigBuilder $configBuilder = null )
$config array
$logger Psr\Log\LoggerInterface
$configBuilder Graby\SiteConfig\ConfigBuilder

buildSiteConfig() публичный Метод

Returns SiteConfig instance (joined in order: exact match, wildcard, fingerprint, global, default).
public buildSiteConfig ( string $url, string $html = '', boolean $addToCache = true ) : SiteConfig
$url string
$html string
$addToCache boolean
Результат Graby\SiteConfig\SiteConfig

findHostUsingFingerprints() публичный Метод

It allow to determine if a website is generated using Wordpress, Blogger, etc ..
public findHostUsingFingerprints ( string $html ) : string | false
$html string
Результат string | false

getContent() публичный Метод

public getContent ( )

getLanguage() публичный Метод

public getLanguage ( )

getNextPageUrl() публичный Метод

public getNextPageUrl ( )

getSiteConfig() публичный Метод

public getSiteConfig ( )

getTitle() публичный Метод

public getTitle ( )

process() публичный Метод

Tidy helps us deal with PHP's patchy HTML parsing most of the time but it has problems of its own which we try to avoid with this option.
public process ( string $html, string $url, SiteConfig $siteConfig = null, boolean $smartTidy = true ) : boolean
$html string
$url string
$siteConfig Graby\SiteConfig\SiteConfig Will avoid to recalculate the site config
$smartTidy boolean Do we need to tidy the html ?
Результат boolean true on success, false on failure

reset() публичный Метод

public reset ( )

setLogger() публичный Метод

public setLogger ( Psr\Log\LoggerInterface $logger )
$logger Psr\Log\LoggerInterface

Описание свойств

$readability публичное свойство

public $readability