Method |
Description |
|
addSiblings ( DOMWrap\Element $topNode ) |
|
|
cleanupHtml ( ) : string |
Scrape the node content and return the html |
|
convertLinksToText ( DOMWrap\Element $topNode ) |
cleans up and converts any nodes that should be considered text into text |
|
convertToHtml ( DOMWrap\Element $topNode ) : string |
|
|
convertToText ( DOMWrap\Element $topNode ) : string |
Takes an element and turns the P tags into \n\n |
|
getBaselineScoreForSiblings ( DOMWrap\Element $topNode ) : integer |
we could have long articles that have tons of paragraphs so if we tried to calculate the base score against
the total text score of those paragraphs it would be unfair. So we need to normalize the score based on the average scoring
of the paragraphs within the top node. For example if our total score of 10 paragraphs was 1000 but each had an average value of
100 then 100 should be our base. |
|
getFormattedText ( ) : string |
Removes all unnecessary elements and formats the selected text nodes |
|
getSiblingContent ( DOMWrap\Element $currentSibling, integer $baselineScoreForSiblingParagraphs ) : DOMWrap\Element[] |
Adds any siblings that may have a decent score to this node |
|
getTagCleanedText ( DOMWrap\Element $item ) : string |
|
|
isNodeScoreThreshholdMet ( DOMWrap\Element $topNode, DOMWrap\Element $node ) : boolean |
|
|
isTableTagAndNoParagraphsExist ( DOMWrap\Element $topNode ) : boolean |
|
|
postExtractionCleanup ( ) |
Remove any divs that looks like non-content, clusters of links, or paras with no gusto |
|
removeNodesWithNegativeScores ( DOMWrap\Element $topNode ) |
if there are elements inside our top node that have a negative gravity score, let's
give em the boot |
|
removeParagraphsWithFewWords ( DOMWrap\Element $topNode ) |
remove paragraphs that have less than x number of words, would indicate that it's some sort of link |
|
removeSmallParagraphs ( DOMWrap\Element $topNode ) |
|
|
replaceTagsWithText ( DOMWrap\Element $topNode ) |
replace common tags with just text so we don't have any crazy formatting issues
so replace , , , etc. |
|