PHP Class Goose\Modules\Formatters\OutputFormatter

Inheritance: extends Goose\Modules\AbstractModule, implements Goose\Modules\ModuleInterface, use trait Goose\Traits\ArticleMutatorTrait, use trait Goose\Traits\NodeGravityTrait, use trait Goose\Traits\NodeCommonTrait
Exibir arquivo Open project: scotteh/php-goose

Protected Properties

Property Type Description
$CLEANUP_IGNORE_SELECTOR string
$SIBLING_BASE_LINE_SCORE double

Public Methods

Method Description
run ( Goose\Article $article )

Private Methods

Method Description
addSiblings ( DOMWrap\Element $topNode )
cleanupHtml ( ) : string Scrape the node content and return the html
convertLinksToText ( DOMWrap\Element $topNode ) cleans up and converts any nodes that should be considered text into text
convertToHtml ( DOMWrap\Element $topNode ) : string
convertToText ( DOMWrap\Element $topNode ) : string Takes an element and turns the P tags into \n\n
getBaselineScoreForSiblings ( DOMWrap\Element $topNode ) : integer we could have long articles that have tons of paragraphs so if we tried to calculate the base score against the total text score of those paragraphs it would be unfair. So we need to normalize the score based on the average scoring of the paragraphs within the top node. For example if our total score of 10 paragraphs was 1000 but each had an average value of 100 then 100 should be our base.
getFormattedText ( ) : string Removes all unnecessary elements and formats the selected text nodes
getSiblingContent ( DOMWrap\Element $currentSibling, integer $baselineScoreForSiblingParagraphs ) : DOMWrap\Element[] Adds any siblings that may have a decent score to this node
getTagCleanedText ( DOMWrap\Element $item ) : string
isNodeScoreThreshholdMet ( DOMWrap\Element $topNode, DOMWrap\Element $node ) : boolean
isTableTagAndNoParagraphsExist ( DOMWrap\Element $topNode ) : boolean
postExtractionCleanup ( ) Remove any divs that looks like non-content, clusters of links, or paras with no gusto
removeNodesWithNegativeScores ( DOMWrap\Element $topNode ) if there are elements inside our top node that have a negative gravity score, let's give em the boot
removeParagraphsWithFewWords ( DOMWrap\Element $topNode ) remove paragraphs that have less than x number of words, would indicate that it's some sort of link
removeSmallParagraphs ( DOMWrap\Element $topNode )
replaceTagsWithText ( DOMWrap\Element $topNode ) replace common tags with just text so we don't have any crazy formatting issues so replace
, , , etc.

Method Details

run() public method

public run ( Goose\Article $article )
$article Goose\Article

Property Details

$CLEANUP_IGNORE_SELECTOR protected_oe static_oe property

protected static string $CLEANUP_IGNORE_SELECTOR
return string

$SIBLING_BASE_LINE_SCORE protected_oe static_oe property

protected static double $SIBLING_BASE_LINE_SCORE
return double