PHP Класс Readability\Readability

Differences between the PHP port and the original ------------------------------------------------------ Arc90's Readability is designed to run in the browser. It works on the DOM tree (the parsed HTML) after the page's CSS styles have been applied and Javascript code executed. This PHP port does not run inside a browser. We use PHP's ability to parse HTML to build our DOM tree, but we cannot rely on CSS or Javascript support. As such, the results will not always match Arc90's Readability. (For example, if a web page contains CSS style rules or Javascript code which hide certain HTML elements from display, Arc90's Readability will dismiss those from consideration but our PHP port, unable to understand CSS or Javascript, will not know any better.) Another significant difference is that the aim of Arc90's Readability is to re-present the main content block of a given web page so users can read it more easily in their browsers. Correct identification, clean up, and separation of the content block is only a part of this process. This PHP port is only concerned with this part, it does not include code that relates to presentation in the browser - Arc90 already do that extremely well, and for PDF output there's FiveFilters.org's PDF Newspaper: http://fivefilters.org/pdf-newspaper/. Finally, this class contains methods that might be useful for developers working on HTML document fragments. So without deviating too much from the original code (which I don't want to do because it makes debugging and updating more difficult), I've tried to make it a little more developer friendly. You should be able to use the methods here on existing DOMElement objects without passing an entire HTML document to be parsed.

Показать файл Открыть проект Примеры использования класса

Открытые свойства

Свойство	Тип	Описание
$articleContent
$articleTitle
$convertLinksToFootnotes
$debug		no more used, keept to avoid BC
$dom
$lightClean		preserves more content (experimental)
$original_html
$regexps		Defined up here so we don't instantiate them repeatedly in loops.
$revertForcedParagraphElements
$tidied
$tidy_config
$url		optional - URL where HTML was retrieved

Защищенные свойства (Protected)

Свойство	Тип	Описание
$body
$bodyCache		Cache the body HTML in case we need to re-use it later
$domainRegExp		article domain regexp for calibration
$flags		1 \| 2 \| 4; // Start with all processing flags set.
$html
$logger
$parser
$post_filters		output HTML filters
$pre_filters		raw HTML filters
$success		indicates whether we were able to extract or not
$useTidy

Открытые методы

Метод	Описание
__construct ( $html, $url = null, $parser = 'libxml', $use_tidy = true )	Create instance of Readability.
addFlag ( integer $flag )	Add a flag.
addFootnotes ( DOMElement $articleContent )	For easier reading, convert this document to have footnotes at the bottom rather than inline links.
addPostFilter ( $filter, $replacer = '' )	Add post filter for raw output HTML processing.
addPreFilter ( $filter, $replacer = '' )	Add pre filter for raw input HTML processing.
clean ( DOMElement $e, string $tag )	Clean a node of all elements of type "tag".
cleanConditionally ( DOMElement $e, string $tag )	Clean an element of all tags of type "tag" if they look fishy.
cleanHeaders ( DOMElement $e )	Clean out spurious headers from an Element. Checks things like classnames and link density.
cleanStyles ( DOMElement $e )	Remove the style attribute on every $e and under.
flagIsActive ( integer $flag ) : boolean	Check if the given flag is active.
getCommaCount ( string $text ) : integer	Get comma number for a given text.
getContent ( ) : DOMElement	Get article content element.
getInnerText ( DOMElement $e, boolean $normalizeSpaces = true, boolean $flattenLines = false ) : string	Get the inner text of a node.
getLinkDensity ( DOMElement $e, string $excludeExternal = false ) : integer	Get the density of links as a percentage of the content This is the amount of text that is inside a link divided by the total text in the node.
getTitle ( ) : DOMElement	Get article title element.
getWeight ( DOMElement $e ) : integer	Get an element relative weight.
getWordCount ( string $text ) : integer	Get words number for a given text if words separated by a space.
init ( ) : boolean	Runs readability.
killBreaks ( DOMElement $node )	Remove extraneous break tags from a node.
postProcessContent ( DOMElement $articleContent )	Run any post-process modifications to article content as necessary.
prepArticle ( DOMElement $articleContent )	Prepare the article node for display. Clean out any inline styles, iframes, forms, strip extraneous tags, etc.
removeFlag ( integer $flag )	Remove a flag.
setLogger ( Psr\Log\LoggerInterface $logger )

Защищенные методы

Метод	Описание
dbg ( $msg )	Debug.
dump_dbg ( )	Dump debug info.
getArticleTitle ( ) : DOMElement	Get the article title as an H1.
grabArticle ( DOMElement $page = null ) : DOMElement \| boolean	Using a variety of metrics (content score, classname, element types), find the content that is most likely to be the stuff a user wants to read. Then return it wrapped up in a div.
initializeNode ( DOMElement $node )	Initialize a node with the readability object. Also checks the className/id for special names to add to its score.
prepDocument ( )	Prepare the HTML document for readability to scrape it.
reinitBody ( )	Will recreate previously deleted body property.
weightAttribute ( DOMElement $element, string $attribute ) : integer	Get an element weight by attribute.

Приватные методы

Метод	Описание
loadHtml ( )	Load HTML in a DOMDocument.

Описание методов

__construct() публичный Метод

Create instance of Readability.

public __construct ( $html, $url = null, $parser = 'libxml', $use_tidy = true )

addFlag() публичный Метод

Add a flag.

public addFlag ( integer $flag )
$flag	integer

addFootnotes() публичный Метод

For easier reading, convert this document to have footnotes at the bottom rather than inline links.

См. также: http://www.roughtype.com/archives/2010/05/experiments_in.php

public addFootnotes ( DOMElement $articleContent )
$articleContent	DOMElement

addPostFilter() публичный Метод

Add post filter for raw output HTML processing.

public addPostFilter ( $filter, $replacer = '' )

addPreFilter() публичный Метод

Add pre filter for raw input HTML processing.

public addPreFilter ( $filter, $replacer = '' )

clean() публичный Метод

(Unless it's a youtube/vimeo video. People love movies.). Updated 2012-09-18 to preserve youtube/vimeo iframes

public clean ( DOMElement $e, string $tag )
$e	DOMElement
$tag	string

cleanConditionally() публичный Метод

"Fishy" is an algorithm based on content length, classnames, link density, number of images & embeds, etc.

public cleanConditionally ( DOMElement $e, string $tag )
$e	DOMElement
$tag	string

cleanHeaders() публичный Метод

Clean out spurious headers from an Element. Checks things like classnames and link density.

public cleanHeaders ( DOMElement $e )
$e	DOMElement

cleanStyles() публичный Метод

Remove the style attribute on every $e and under.

public cleanStyles ( DOMElement $e )
$e	DOMElement

dbg() защищенный Метод

Debug.

Устаревший: use $this->logger->debug() instead

protected dbg ( $msg )

dump_dbg() защищенный Метод

Dump debug info.

Устаревший: since Monolog gather log, we don't need it

protected dump_dbg ( )

flagIsActive() публичный Метод

Check if the given flag is active.

public flagIsActive ( integer $flag ) : boolean
$flag	integer
Результат	boolean

getArticleTitle() защищенный Метод

Get the article title as an H1.

protected getArticleTitle ( ) : DOMElement
Результат	DOMElement

getCommaCount() публичный Метод

Get comma number for a given text.

public getCommaCount ( string $text ) : integer
$text	string
Результат	integer

getContent() публичный Метод

Get article content element.

public getContent ( ) : DOMElement
Результат	DOMElement

getInnerText() публичный Метод

This also strips out any excess whitespace to be found.

public getInnerText ( DOMElement $e, boolean $normalizeSpaces = true, boolean $flattenLines = false ) : string
$e	DOMElement
$normalizeSpaces	boolean	(default: true)
$flattenLines	boolean	(default: false)
Результат	string

getLinkDensity() публичный Метод

Can exclude external references to differentiate between simple text and menus/infoblocks.

public getLinkDensity ( DOMElement $e, string $excludeExternal = false ) : integer
$e	DOMElement
$excludeExternal	string
Результат	integer

getTitle() публичный Метод

Get article title element.

public getTitle ( ) : DOMElement
Результат	DOMElement

getWeight() публичный Метод

Get an element relative weight.

public getWeight ( DOMElement $e ) : integer
$e	DOMElement
Результат	integer

getWordCount() публичный Метод

Input string should be normalized.

public getWordCount ( string $text ) : integer
$text	string
Результат	integer

grabArticle() защищенный Метод

Using a variety of metrics (content score, classname, element types), find the content that is most likely to be the stuff a user wants to read. Then return it wrapped up in a div.

protected grabArticle ( DOMElement $page = null ) : DOMElement \| boolean
$page	DOMElement
Результат	DOMElement \| boolean

init() публичный Метод

Workflow: 1. Prep the document by removing script tags, css, etc. 2. Build readability's DOM tree. 3. Grab the article content from the current dom tree. 4. Replace the current DOM tree with the new one. 5. Read peacefully.

public init ( ) : boolean
Результат	boolean	true if we found content, false otherwise

initializeNode() защищенный Метод

Initialize a node with the readability object. Also checks the className/id for special names to add to its score.

protected initializeNode ( DOMElement $node )
$node	DOMElement

killBreaks() публичный Метод

Remove extraneous break tags from a node.

public killBreaks ( DOMElement $node )
$node	DOMElement

postProcessContent() публичный Метод

Run any post-process modifications to article content as necessary.

public postProcessContent ( DOMElement $articleContent )
$articleContent	DOMElement

prepArticle() публичный Метод

Prepare the article node for display. Clean out any inline styles, iframes, forms, strip extraneous

tags, etc.

public prepArticle ( DOMElement $articleContent )
$articleContent	DOMElement

prepDocument() защищенный Метод

This includes things like stripping javascript, CSS, and handling terrible markup.

protected prepDocument ( )

reinitBody() защищенный Метод

Will recreate previously deleted body property.

protected reinitBody ( )

removeFlag() публичный Метод

Remove a flag.

public removeFlag ( integer $flag )
$flag	integer

setLogger() публичный Метод

public setLogger ( Psr\Log\LoggerInterface $logger )
$logger	Psr\Log\LoggerInterface

weightAttribute() защищенный Метод

Uses regular expressions to tell if this element looks good or bad.

protected weightAttribute ( DOMElement $element, string $attribute ) : integer
$element	DOMElement
$attribute	string
Результат	integer

Описание свойств

$articleContent публичное свойство

public $articleContent

$articleTitle публичное свойство

public $articleTitle

$body защищенное свойство

protected $body

$bodyCache защищенное свойство

Cache the body HTML in case we need to re-use it later

protected $bodyCache

$convertLinksToFootnotes публичное свойство

public $convertLinksToFootnotes

$debug публичное свойство

no more used, keept to avoid BC

public $debug

$dom публичное свойство

public $dom

$domainRegExp защищенное свойство

article domain regexp for calibration

protected $domainRegExp

$flags защищенное свойство

1 | 2 | 4; // Start with all processing flags set.

protected $flags

$html защищенное свойство

protected $html

$lightClean публичное свойство

preserves more content (experimental)

public $lightClean

$logger защищенное свойство

protected $logger

$original_html публичное свойство

public $original_html

$parser защищенное свойство

protected $parser

$post_filters защищенное свойство

output HTML filters

protected $post_filters

$pre_filters защищенное свойство

raw HTML filters

protected $pre_filters

$regexps публичное свойство

Defined up here so we don't instantiate them repeatedly in loops.

public $regexps

$revertForcedParagraphElements публичное свойство

public $revertForcedParagraphElements

$success защищенное свойство

indicates whether we were able to extract or not

protected $success

$tidied публичное свойство

public $tidied

$tidy_config публичное свойство

public $tidy_config

$url публичное свойство

optional - URL where HTML was retrieved

public $url

$useTidy защищенное свойство

protected $useTidy