PHP Класс Arachnid\Crawler

This class will crawl all unique internal links found on a given website up to a specified maximum page depth. This library is based on the original blog post by Zeid Rashwani here: Josh Lockhart adapted the original blog post's code (with permission) for Composer and Packagist and updated the syntax to conform with the PSR-2 coding standard.

Показать файл Открыть проект

Защищенные свойства (Protected)

Свойство	Тип	Описание
$baseUrl	string	The base URL from which the crawler begins crawling
$links	array	Array of links (and related data) found by the crawler
$maxDepth	integer	The max depth the crawler will crawl

Открытые методы

Метод	Описание
__construct ( string $baseUrl, integer $maxDepth = 3 )	Constructor
getLinks ( ) : array	Get links (and related data) found by the crawler
traverse ( string $url = null )	Initiate the crawl

Защищенные методы

Метод	Описание
checkIfCrawlable ( string $uri ) : boolean	Is a given URL crawlable?
checkIfExternal ( string $url ) : boolean	Is URL external?
extractLinksInfo ( Crawler $crawler, string $url ) : array	Extract links information from url
extractTitleInfo ( Crawler $crawler, string $url )	Extract title information from url
getPathFromUrl ( type $url ) : type	extrating the relative path from url string
getScrapClient ( ) : Client	create and configure goutte client used for scraping
normalizeLink ( $uri ) : string	Normalize link (remove hash, etc.)
traverseChildren ( array $childLinks, integer $depth )	Crawl child links
traverseSingle ( string $url, integer $depth )	Crawl single URL

Описание методов

__construct() публичный Метод

Constructor

public __construct ( string $baseUrl, integer $maxDepth = 3 )
$baseUrl	string
$maxDepth	integer

checkIfCrawlable() защищенный Метод

Is a given URL crawlable?

protected checkIfCrawlable ( string $uri ) : boolean
$uri	string
Результат	boolean

checkIfExternal() защищенный Метод

Is URL external?

protected checkIfExternal ( string $url ) : boolean
$url	string	An absolute URL (with scheme)
Результат	boolean

extractLinksInfo() защищенный Метод

Extract links information from url

protected extractLinksInfo ( Crawler $crawler, string $url ) : array
$crawler	Symfony\Component\DomCrawler\Crawler
$url	string
Результат	array

extractTitleInfo() защищенный Метод

Extract title information from url

protected extractTitleInfo ( Crawler $crawler, string $url )
$crawler	Symfony\Component\DomCrawler\Crawler
$url	string

getLinks() публичный Метод

Get links (and related data) found by the crawler

public getLinks ( ) : array
Результат	array

getPathFromUrl() защищенный Метод

extrating the relative path from url string

protected getPathFromUrl ( type $url ) : type
$url	type
Результат	type

getScrapClient() защищенный Метод

create and configure goutte client used for scraping

protected getScrapClient ( ) : Client
Результат	Goutte\Client

normalizeLink() защищенный Метод

Normalize link (remove hash, etc.)

protected normalizeLink ( $uri ) : string
Результат	string

traverse() публичный Метод

Initiate the crawl

public traverse ( string $url = null )
$url	string

traverseChildren() защищенный Метод

Crawl child links

protected traverseChildren ( array $childLinks, integer $depth )
$childLinks	array
$depth	integer

traverseSingle() защищенный Метод

Crawl single URL

protected traverseSingle ( string $url, integer $depth )
$url	string
$depth	integer

Описание свойств

$baseUrl защищенное свойство

The base URL from which the crawler begins crawling

protected string $baseUrl
Результат	string

$links защищенное свойство

Array of links (and related data) found by the crawler

protected array $links
Результат	array

$maxDepth защищенное свойство

The max depth the crawler will crawl

protected int $maxDepth
Результат	integer