PHP Class Arachnid\Crawler

This class will crawl all unique internal links found on a given website up to a specified maximum page depth. This library is based on the original blog post by Zeid Rashwani here: Josh Lockhart adapted the original blog post's code (with permission) for Composer and Packagist and updated the syntax to conform with the PSR-2 coding standard.
Afficher le fichier Open project: codeguy/arachnid

Protected Properties

Свойство Type Description
$baseUrl string The base URL from which the crawler begins crawling
$links array Array of links (and related data) found by the crawler
$maxDepth integer The max depth the crawler will crawl

Méthodes publiques

Méthode Description
__construct ( string $baseUrl, integer $maxDepth = 3 ) Constructor
getLinks ( ) : array Get links (and related data) found by the crawler
traverse ( string $url = null ) Initiate the crawl

Méthodes protégées

Méthode Description
checkIfCrawlable ( string $uri ) : boolean Is a given URL crawlable?
checkIfExternal ( string $url ) : boolean Is URL external?
extractLinksInfo ( Crawler $crawler, string $url ) : array Extract links information from url
extractTitleInfo ( Crawler $crawler, string $url ) Extract title information from url
getPathFromUrl ( type $url ) : type extrating the relative path from url string
getScrapClient ( ) : Client create and configure goutte client used for scraping
normalizeLink ( $uri ) : string Normalize link (remove hash, etc.)
traverseChildren ( array $childLinks, integer $depth ) Crawl child links
traverseSingle ( string $url, integer $depth ) Crawl single URL

Method Details

__construct() public méthode

Constructor
public __construct ( string $baseUrl, integer $maxDepth = 3 )
$baseUrl string
$maxDepth integer

checkIfCrawlable() protected méthode

Is a given URL crawlable?
protected checkIfCrawlable ( string $uri ) : boolean
$uri string
Résultat boolean

checkIfExternal() protected méthode

Is URL external?
protected checkIfExternal ( string $url ) : boolean
$url string An absolute URL (with scheme)
Résultat boolean

extractLinksInfo() protected méthode

Extract links information from url
protected extractLinksInfo ( Crawler $crawler, string $url ) : array
$crawler Symfony\Component\DomCrawler\Crawler
$url string
Résultat array

extractTitleInfo() protected méthode

Extract title information from url
protected extractTitleInfo ( Crawler $crawler, string $url )
$crawler Symfony\Component\DomCrawler\Crawler
$url string

getPathFromUrl() protected méthode

extrating the relative path from url string
protected getPathFromUrl ( type $url ) : type
$url type
Résultat type

getScrapClient() protected méthode

create and configure goutte client used for scraping
protected getScrapClient ( ) : Client
Résultat Goutte\Client

traverse() public méthode

Initiate the crawl
public traverse ( string $url = null )
$url string

traverseChildren() protected méthode

Crawl child links
protected traverseChildren ( array $childLinks, integer $depth )
$childLinks array
$depth integer

traverseSingle() protected méthode

Crawl single URL
protected traverseSingle ( string $url, integer $depth )
$url string
$depth integer

Property Details

$baseUrl protected_oe property

The base URL from which the crawler begins crawling
protected string $baseUrl
Résultat string

$maxDepth protected_oe property

The max depth the crawler will crawl
protected int $maxDepth
Résultat integer