PHP 클래스 Readability\Readability

Differences between the PHP port and the original ------------------------------------------------------ Arc90's Readability is designed to run in the browser. It works on the DOM tree (the parsed HTML) after the page's CSS styles have been applied and Javascript code executed. This PHP port does not run inside a browser. We use PHP's ability to parse HTML to build our DOM tree, but we cannot rely on CSS or Javascript support. As such, the results will not always match Arc90's Readability. (For example, if a web page contains CSS style rules or Javascript code which hide certain HTML elements from display, Arc90's Readability will dismiss those from consideration but our PHP port, unable to understand CSS or Javascript, will not know any better.) Another significant difference is that the aim of Arc90's Readability is to re-present the main content block of a given web page so users can read it more easily in their browsers. Correct identification, clean up, and separation of the content block is only a part of this process. This PHP port is only concerned with this part, it does not include code that relates to presentation in the browser - Arc90 already do that extremely well, and for PDF output there's FiveFilters.org's PDF Newspaper: http://fivefilters.org/pdf-newspaper/. Finally, this class contains methods that might be useful for developers working on HTML document fragments. So without deviating too much from the original code (which I don't want to do because it makes debugging and updating more difficult), I've tried to make it a little more developer friendly. You should be able to use the methods here on existing DOMElement objects without passing an entire HTML document to be parsed.
파일 보기 프로젝트 열기: j0k3r/php-readability 1 사용 예제들

공개 프로퍼티들

프로퍼티 타입 설명
$articleContent
$articleTitle
$convertLinksToFootnotes
$debug no more used, keept to avoid BC
$dom
$lightClean preserves more content (experimental)
$original_html
$regexps Defined up here so we don't instantiate them repeatedly in loops.
$revertForcedParagraphElements
$tidied
$tidy_config
$url optional - URL where HTML was retrieved

보호된 프로퍼티들

프로퍼티 타입 설명
$body
$bodyCache Cache the body HTML in case we need to re-use it later
$domainRegExp article domain regexp for calibration
$flags 1 | 2 | 4; // Start with all processing flags set.
$html
$logger
$parser
$post_filters output HTML filters
$pre_filters raw HTML filters
$success indicates whether we were able to extract or not
$useTidy

공개 메소드들

메소드 설명
__construct ( $html, $url = null, $parser = 'libxml', $use_tidy = true ) Create instance of Readability.
addFlag ( integer $flag ) Add a flag.
addFootnotes ( DOMElement $articleContent ) For easier reading, convert this document to have footnotes at the bottom rather than inline links.
addPostFilter ( $filter, $replacer = '' ) Add post filter for raw output HTML processing.
addPreFilter ( $filter, $replacer = '' ) Add pre filter for raw input HTML processing.
clean ( DOMElement $e, string $tag ) Clean a node of all elements of type "tag".
cleanConditionally ( DOMElement $e, string $tag ) Clean an element of all tags of type "tag" if they look fishy.
cleanHeaders ( DOMElement $e ) Clean out spurious headers from an Element. Checks things like classnames and link density.
cleanStyles ( DOMElement $e ) Remove the style attribute on every $e and under.
flagIsActive ( integer $flag ) : boolean Check if the given flag is active.
getCommaCount ( string $text ) : integer Get comma number for a given text.
getContent ( ) : DOMElement Get article content element.
getInnerText ( DOMElement $e, boolean $normalizeSpaces = true, boolean $flattenLines = false ) : string Get the inner text of a node.
getLinkDensity ( DOMElement $e, string $excludeExternal = false ) : integer Get the density of links as a percentage of the content This is the amount of text that is inside a link divided by the total text in the node.
getTitle ( ) : DOMElement Get article title element.
getWeight ( DOMElement $e ) : integer Get an element relative weight.
getWordCount ( string $text ) : integer Get words number for a given text if words separated by a space.
init ( ) : boolean Runs readability.
killBreaks ( DOMElement $node ) Remove extraneous break tags from a node.
postProcessContent ( DOMElement $articleContent ) Run any post-process modifications to article content as necessary.
prepArticle ( DOMElement $articleContent ) Prepare the article node for display. Clean out any inline styles, iframes, forms, strip extraneous

tags, etc.

removeFlag ( integer $flag ) Remove a flag.
setLogger ( Psr\Log\LoggerInterface $logger )

보호된 메소드들

메소드 설명
dbg ( $msg ) Debug.
dump_dbg ( ) Dump debug info.
getArticleTitle ( ) : DOMElement Get the article title as an H1.
grabArticle ( DOMElement $page = null ) : DOMElement | boolean Using a variety of metrics (content score, classname, element types), find the content that is most likely to be the stuff a user wants to read. Then return it wrapped up in a div.
initializeNode ( DOMElement $node ) Initialize a node with the readability object. Also checks the className/id for special names to add to its score.
prepDocument ( ) Prepare the HTML document for readability to scrape it.
reinitBody ( ) Will recreate previously deleted body property.
weightAttribute ( DOMElement $element, string $attribute ) : integer Get an element weight by attribute.

비공개 메소드들

메소드 설명
loadHtml ( ) Load HTML in a DOMDocument.

메소드 상세

__construct() 공개 메소드

Create instance of Readability.
public __construct ( $html, $url = null, $parser = 'libxml', $use_tidy = true )

addFlag() 공개 메소드

Add a flag.
public addFlag ( integer $flag )
$flag integer

addFootnotes() 공개 메소드

For easier reading, convert this document to have footnotes at the bottom rather than inline links.
또한 보기: http://www.roughtype.com/archives/2010/05/experiments_in.php
public addFootnotes ( DOMElement $articleContent )
$articleContent DOMElement

addPostFilter() 공개 메소드

Add post filter for raw output HTML processing.
public addPostFilter ( $filter, $replacer = '' )

addPreFilter() 공개 메소드

Add pre filter for raw input HTML processing.
public addPreFilter ( $filter, $replacer = '' )

clean() 공개 메소드

(Unless it's a youtube/vimeo video. People love movies.). Updated 2012-09-18 to preserve youtube/vimeo iframes
public clean ( DOMElement $e, string $tag )
$e DOMElement
$tag string

cleanConditionally() 공개 메소드

"Fishy" is an algorithm based on content length, classnames, link density, number of images & embeds, etc.
public cleanConditionally ( DOMElement $e, string $tag )
$e DOMElement
$tag string

cleanHeaders() 공개 메소드

Clean out spurious headers from an Element. Checks things like classnames and link density.
public cleanHeaders ( DOMElement $e )
$e DOMElement

cleanStyles() 공개 메소드

Remove the style attribute on every $e and under.
public cleanStyles ( DOMElement $e )
$e DOMElement

dbg() 보호된 메소드

Debug.
사용 중단: use $this->logger->debug() instead
protected dbg ( $msg )

dump_dbg() 보호된 메소드

Dump debug info.
사용 중단: since Monolog gather log, we don't need it
protected dump_dbg ( )

flagIsActive() 공개 메소드

Check if the given flag is active.
public flagIsActive ( integer $flag ) : boolean
$flag integer
리턴 boolean

getArticleTitle() 보호된 메소드

Get the article title as an H1.
protected getArticleTitle ( ) : DOMElement
리턴 DOMElement

getCommaCount() 공개 메소드

Get comma number for a given text.
public getCommaCount ( string $text ) : integer
$text string
리턴 integer

getContent() 공개 메소드

Get article content element.
public getContent ( ) : DOMElement
리턴 DOMElement

getInnerText() 공개 메소드

This also strips out any excess whitespace to be found.
public getInnerText ( DOMElement $e, boolean $normalizeSpaces = true, boolean $flattenLines = false ) : string
$e DOMElement
$normalizeSpaces boolean (default: true)
$flattenLines boolean (default: false)
리턴 string

getLinkDensity() 공개 메소드

Can exclude external references to differentiate between simple text and menus/infoblocks.
public getLinkDensity ( DOMElement $e, string $excludeExternal = false ) : integer
$e DOMElement
$excludeExternal string
리턴 integer

getTitle() 공개 메소드

Get article title element.
public getTitle ( ) : DOMElement
리턴 DOMElement

getWeight() 공개 메소드

Get an element relative weight.
public getWeight ( DOMElement $e ) : integer
$e DOMElement
리턴 integer

getWordCount() 공개 메소드

Input string should be normalized.
public getWordCount ( string $text ) : integer
$text string
리턴 integer

grabArticle() 보호된 메소드

Using a variety of metrics (content score, classname, element types), find the content that is most likely to be the stuff a user wants to read. Then return it wrapped up in a div.
protected grabArticle ( DOMElement $page = null ) : DOMElement | boolean
$page DOMElement
리턴 DOMElement | boolean

init() 공개 메소드

Workflow: 1. Prep the document by removing script tags, css, etc. 2. Build readability's DOM tree. 3. Grab the article content from the current dom tree. 4. Replace the current DOM tree with the new one. 5. Read peacefully.
public init ( ) : boolean
리턴 boolean true if we found content, false otherwise

initializeNode() 보호된 메소드

Initialize a node with the readability object. Also checks the className/id for special names to add to its score.
protected initializeNode ( DOMElement $node )
$node DOMElement

killBreaks() 공개 메소드

Remove extraneous break tags from a node.
public killBreaks ( DOMElement $node )
$node DOMElement

postProcessContent() 공개 메소드

Run any post-process modifications to article content as necessary.
public postProcessContent ( DOMElement $articleContent )
$articleContent DOMElement

prepArticle() 공개 메소드

Prepare the article node for display. Clean out any inline styles, iframes, forms, strip extraneous

tags, etc.

public prepArticle ( DOMElement $articleContent )
$articleContent DOMElement

prepDocument() 보호된 메소드

This includes things like stripping javascript, CSS, and handling terrible markup.
protected prepDocument ( )

reinitBody() 보호된 메소드

Will recreate previously deleted body property.
protected reinitBody ( )

removeFlag() 공개 메소드

Remove a flag.
public removeFlag ( integer $flag )
$flag integer

setLogger() 공개 메소드

public setLogger ( Psr\Log\LoggerInterface $logger )
$logger Psr\Log\LoggerInterface

weightAttribute() 보호된 메소드

Uses regular expressions to tell if this element looks good or bad.
protected weightAttribute ( DOMElement $element, string $attribute ) : integer
$element DOMElement
$attribute string
리턴 integer

프로퍼티 상세

$articleContent 공개적으로 프로퍼티

public $articleContent

$articleTitle 공개적으로 프로퍼티

public $articleTitle

$body 보호되어 있는 프로퍼티

protected $body

$bodyCache 보호되어 있는 프로퍼티

Cache the body HTML in case we need to re-use it later
protected $bodyCache

$convertLinksToFootnotes 공개적으로 프로퍼티

public $convertLinksToFootnotes

$debug 공개적으로 프로퍼티

no more used, keept to avoid BC
public $debug

$dom 공개적으로 프로퍼티

public $dom

$domainRegExp 보호되어 있는 프로퍼티

article domain regexp for calibration
protected $domainRegExp

$flags 보호되어 있는 프로퍼티

1 | 2 | 4; // Start with all processing flags set.
protected $flags

$html 보호되어 있는 프로퍼티

protected $html

$lightClean 공개적으로 프로퍼티

preserves more content (experimental)
public $lightClean

$logger 보호되어 있는 프로퍼티

protected $logger

$original_html 공개적으로 프로퍼티

public $original_html

$parser 보호되어 있는 프로퍼티

protected $parser

$post_filters 보호되어 있는 프로퍼티

output HTML filters
protected $post_filters

$pre_filters 보호되어 있는 프로퍼티

raw HTML filters
protected $pre_filters

$regexps 공개적으로 프로퍼티

Defined up here so we don't instantiate them repeatedly in loops.
public $regexps

$revertForcedParagraphElements 공개적으로 프로퍼티

public $revertForcedParagraphElements

$success 보호되어 있는 프로퍼티

indicates whether we were able to extract or not
protected $success

$tidied 공개적으로 프로퍼티

public $tidied

$tidy_config 공개적으로 프로퍼티

public $tidy_config

$url 공개적으로 프로퍼티

optional - URL where HTML was retrieved
public $url

$useTidy 보호되어 있는 프로퍼티

protected $useTidy