PHPRO.ORG

Get Links With DOM

Get Links With DOM

Do not use REGEX to parse HTML

Perhaps the biggest mistake people make when trying to get URLs or link text from a web page is trying to do it using regular expressions. The job can be done with regular expressions, however, there is a high overhead in having preg loop over the entire document many times. The correct way, and the faster, and infinitely cooler ways is to use DOM.

By using DOM in the getLinks functions it is simple to create an array containing all the links on a web page as keys, and the link names as values. This array can then be looped over like any array and a list created, or manipulated in any way desired.

Note that error suppression is used when loading the HTML. This is to suppress warnings about invalid HTML entities that are not defined in the DOCTYPE. But of course, in a production environment, error reporting would be disabled and error reporting set to none.


<?php
    
function getLinks($link)
    {
        
/*** return array ***/
        
$ret = array();

        
/*** a new dom object ***/
        
$dom = new domDocument;

        
/*** get the HTML (suppress errors) ***/
        
@$dom->loadHTML(file_get_contents($link));

        
/*** remove silly white space ***/
        
$dom->preserveWhiteSpace false;

        
/*** get the links from the HTML ***/
        
$links $dom->getElementsByTagName('a');
    
        
/*** loop over the links ***/
        
foreach ($links as $tag)
        {
            
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
        }

        return 
$ret;
    }
?>

A similar approach could be to use XPath which would achieve the same results. Either way, using the DOM is going to prove far more efficient than REGEX.

Example Usage


<?php
    
/*** a link to search ***/
    
$link "http://php.net";

    
/*** get the links ***/
    
$urls getLinks($link);

    
/*** check for results ***/
    
if(sizeof($urls) > 0)
    {
        foreach(
$urls as $key=>$value)
        {
            echo 
$key ' - '$value '<br >';
        }
    }
    else
    {
        echo 
"No links found at $link";
    }
?>

Demonstration

/ -
/downloads - Downloads
/docs.php - Documentation
/get-involved - Get Involved
/support - Help
/manual/en/getting-started.php - Getting Started
/manual/en/introduction.php - Introduction
/manual/en/tutorial.php - A simple tutorial
/manual/en/langref.php - Language Reference
/manual/en/language.basic-syntax.php - Basic syntax
/manual/en/language.types.php - Types
/manual/en/language.variables.php - Variables
/manual/en/language.constants.php - Constants
/manual/en/language.expressions.php - Expressions
/manual/en/language.operators.php - Operators
/manual/en/language.control-structures.php - Control Structures
/manual/en/language.functions.php - Functions
/manual/en/language.oop5.php - Classes and Objects
/manual/en/language.namespaces.php - Namespaces
/manual/en/language.errors.php - Errors
/manual/en/language.exceptions.php - Exceptions
/manual/en/language.generators.php - Generators
/manual/en/language.references.php - References Explained
/manual/en/reserved.variables.php - Predefined Variables
/manual/en/reserved.exceptions.php - Predefined Exceptions
/manual/en/reserved.interfaces.php - Predefined Interfaces and Classes
/manual/en/context.php - Context options and parameters
/manual/en/wrappers.php - Supported Protocols and Wrappers
/manual/en/security.php - Security
/manual/en/security.intro.php - Introduction
/manual/en/security.general.php - General considerations
/manual/en/security.cgi-bin.php - Installed as CGI binary
/manual/en/security.apache.php - Installed as an Apache module
/manual/en/security.sessions.php - Session Security
/manual/en/security.filesystem.php - Filesystem Security
/manual/en/security.database.php - Database Security
/manual/en/security.errors.php - Error Reporting
/manual/en/security.globals.php - Using Register Globals
/manual/en/security.variables.php - User Submitted Data
/manual/en/security.magicquotes.php - Magic Quotes
/manual/en/security.hiding.php - Hiding PHP
/manual/en/security.current.php - Keeping Current
/manual/en/features.php - Features
/manual/en/features.http-auth.php - HTTP authentication with PHP
/manual/en/features.cookies.php - Cookies
/manual/en/features.sessions.php - Sessions
/manual/en/features.xforms.php - Dealing with XForms
/manual/en/features.file-upload.php - Handling file uploads
/manual/en/features.remote-files.php - Using remote files
/manual/en/features.connection-handling.php - Connection handling
/manual/en/features.persistent-connections.php - Persistent Database Connections
/manual/en/features.safe-mode.php - Safe Mode
/manual/en/features.commandline.php - Command line usage
/manual/en/features.gc.php - Garbage Collection
/manual/en/features.dtrace.php - DTrace Dynamic Tracing
/manual/en/funcref.php - Function Reference
/manual/en/refs.basic.php.php - Affecting PHP's Behaviour
/manual/en/refs.utilspec.audio.php - Audio Formats Manipulation
/manual/en/refs.remote.auth.php - Authentication Services
/manual/en/refs.utilspec.cmdline.php - Command Line Specific Extensions
/manual/en/refs.compression.php - Compression and Archive Extensions
/manual/en/refs.creditcard.php - Credit Card Processing
/manual/en/refs.crypto.php - Cryptography Extensions
/manual/en/refs.database.php - Database Extensions
/manual/en/refs.calendar.php - Date and Time Related Extensions
/manual/en/refs.fileprocess.file.php - File System Related Extensions
/manual/en/refs.international.php - Human Language and Character Encoding Support
/manual/en/refs.utilspec.image.php - Image Processing and Generation
/manual/en/refs.remote.mail.php - Mail Related Extensions
/manual/en/refs.math.php - Mathematical Extensions
/manual/en/refs.utilspec.nontext.php - Non-Text MIME Output
/manual/en/refs.fileprocess.process.php - Process Control Extensions
/manual/en/refs.basic.other.php - Other Basic Extensions
/manual/en/refs.remote.other.php - Other Services
/manual/en/refs.search.php - Search Engine Extensions
/manual/en/refs.utilspec.server.php - Server Specific Extensions
/manual/en/refs.basic.session.php - Session Extensions
/manual/en/refs.basic.text.php - Text Processing
/manual/en/refs.basic.vartype.php - Variable and Type Related Extensions
/manual/en/refs.webservice.php - Web Services
/manual/en/refs.utilspec.windows.php - Windows Only Extensions
/manual/en/refs.xml.php - XML Manipulation
/manual/en/refs.ui.php - GUI Extensions
/downloads.php#v5.6.39 - 5.6.39
/ChangeLog-5.php#5.6.39 - Release Notes
/migration56 - Upgrading
/downloads.php#v7.0.33 - 7.0.33
/ChangeLog-7.php#7.0.33 - Release Notes
/migration70 - Upgrading
/downloads.php#v7.1.25 - 7.1.25
/ChangeLog-7.php#7.1.25 - Release Notes
/migration71 - Upgrading
/downloads.php#v7.2.13 - 7.2.13
/ChangeLog-7.php#7.2.13 - Release Notes
/migration72 - Upgrading
/downloads.php#v7.3.0 - 7.3.0
/ChangeLog-7.php#7.3.0 - Release Notes
/migration73 - Upgrading
http://php.net/archive/2018.php#id2018-12-06-5 - PHP 7.0.33 Released
http://www.php.net/downloads.php - downloads page
http://windows.php.net/download/ - windows.php.net/download/
http://www.php.net/ChangeLog-7.php#7.0.33 - ChangeLog
http://php.net/supported-versions.php - PHP version support timelines
http://php.net/archive/2018.php#id2018-12-06-4 - PHP 7.1.25 Released
http://www.php.net/ChangeLog-7.php#7.1.25 - ChangeLog
http://php.net/archive/2018.php#id2018-12-06-3 - PHP 7.2.13 Released
http://www.php.net/ChangeLog-7.php#7.2.13 - ChangeLog
http://php.net/archive/2018.php#id2018-12-06-2 - PHP 5.6.39 Released
http://www.php.net/ChangeLog-5.php#5.6.39 - ChangeLog
http://php.net/archive/2018.php#id2018-12-06-1 - PHP 7.3.0 Released
http://php.net/manual/migration73.new-features.php#migration73.new-features.core.heredoc - Flexible Heredoc and Nowdoc Syntax
http://php.net/manual/migration73.other-changes.php#migration73.other-changes.pcre - PCRE2 Migration
http://php.net/manual/migration73.new-features.php#migration73.new-features.mbstring - Multiple MBString Improvements
http://php.net/manual/migration73.new-features.php#migration73.new-features.ldap - LDAP Controls Support
http://php.net/manual/migration73.new-features.php#migration73.new-features.fpm - Improved FPM Logging
http://php.net/manual/migration73.windows-support.php#migration73.windows-support.core.file-descriptors - Windows File Deletion Improvements
http://php.net/manual/migration73.deprecated.php - Several Deprecations
http://www.php.net/downloads - downloads
http://windows.php.net/download - PHP for Windows
http://www.php.net/ChangeLog-7.php#7.3.0 - ChangeLog
http://php.net/manual/en/migration73.php - migration guide
http://php.net/archive/2018.php#id2018-11-22-1 - PHP 7.3.0RC6 Released
https://wiki.php.net/todo/php73 - PHP Wiki
https://downloads.php.net/~cmb/ - download page
https://windows.php.net/qa/ - windows.php.net/qa/
https://bugs.php.net - bug reporting system
https://github.com/php/php-src/blob/php-7.3.0RC6/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.3.0RC6/UPGRADING - UPGRADING
https://github.com/php/php-src/blob/php-7.3.0RC6/UPGRADING.INTERNALS - UPGRADING.INTERNALS
https://gist.github.com/cmb69/6d9574612d0fb78b8549e42ec096a5a6 - the manifest
https://qa.php.net/ - the QA site
http://php.net/archive/2018.php#id2018-11-08-3 - PHP 7.1.24 Released
http://www.php.net/ChangeLog-7.php#7.1.24 - ChangeLog
http://php.net/archive/2018.php#id2018-11-08-1 - PHP 7.3.0RC5 Released
https://github.com/php/php-src/blob/php-7.3.0RC5/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.3.0RC5/UPGRADING - UPGRADING
https://github.com/php/php-src/blob/php-7.3.0RC5/UPGRADING.INTERNALS - UPGRADING.INTERNALS
https://gist.github.com/cmb69/a14634afdd52b7f69d65d2bd5a79ac99 - the manifest
http://php.net/archive/2018.php#id2018-10-25-1 - PHP 7.3.0RC4 Released
https://github.com/php/php-src/blob/php-7.3.0RC4/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.3.0RC4/UPGRADING - UPGRADING
https://github.com/php/php-src/blob/php-7.3.0RC4/UPGRADING.INTERNALS - UPGRADING.INTERNALS
https://gist.github.com/cmb69/594d9a18290f1b019b2ba68a098413c6 - the manifest
http://php.net/archive/2018.php#id2018-10-11-1 - PHP 7.3.0RC3 Released
https://github.com/php/php-src/blob/php-7.3.0RC3/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.3.0RC3/UPGRADING - UPGRADING
https://github.com/php/php-src/blob/php-7.3.0RC3/UPGRADING.INTERNALS - UPGRADING.INTERNALS
https://gist.github.com/cmb69/3b521933b5524c92e880fc96559a5f5c - the manifest
http://php.net/archive/2018.php#id2018-09-28-1 - PHP 7.3.0RC2 Released
https://github.com/php/php-src/blob/php-7.3.0RC2/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.3.0RC2/UPGRADING - UPGRADING
https://github.com/php/php-src/blob/php-7.3.0RC2/UPGRADING.INTERNALS - UPGRADING.INTERNALS
https://gist.github.com/cmb69/ffe9373d127254a19e73e73251e4ff7d - the manifest
http://php.net/archive/2018.php#id2018-09-13-2 - PHP 7.3.0RC1 Released
https://github.com/php/php-src/blob/php-7.3.0RC1/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.3.0RC1/UPGRADING - UPGRADING
https://github.com/php/php-src/blob/php-7.3.0RC1/UPGRADING.INTERNALS - UPGRADING.INTERNALS
https://gist.github.com/cmb69/224ae1ef28b1e3f2e0a62a4ab50966e4 - the manifest
http://php.net/archive/2018.php#id2018-08-30-1 - PHP 7.3.0.beta3 Released
https://github.com/php/php-src/blob/php-7.3.0beta3/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.3.0beta3/UPGRADING - UPGRADING
https://github.com/php/php-src/blob/php-7.3.0beta3/UPGRADING.INTERNALS - UPGRADING.INTERNALS
https://gist.github.com/cmb69/aeef8c8877a451ba6fce6f990dd3860b - the manifest
http://php.net/archive/2018.php#id2018-08-16-1 - PHP 7.3.0.beta2 Released
https://github.com/php/php-src/blob/php-7.3.0beta2/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.3.0beta2/UPGRADING - UPGRADING
https://github.com/php/php-src/blob/php-7.3.0beta2/UPGRADING.INTERNALS - UPGRADING.INTERNALS
https://gist.github.com/cmb69/4bfd2f4d54ebc01cd37ba3dc86f1f814 - the manifest
http://php.net/archive/2018.php#id2018-08-02-1 - PHP 7.3.0.beta1 Released
https://github.com/php/php-src/blob/php-7.3.0beta1/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.3.0beta1/UPGRADING - UPGRADING
https://gist.github.com/cmb69/e666c3f1622321f868de9282bee67e43 - the manifest
http://php.net/archive/2018.php#id2018-07-19-1 - PHP 7.3.0alpha4 Released
https://github.com/php/php-src/blob/php-7.3.0alpha4/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.3.0alpha4/UPGRADING - UPGRADING
https://gist.github.com/cmb69/b30366855341382046687ce7adb20f69 - the manifest
http://php.net/archive/2018.php#id2018-07-05-1 - PHP 7.3.0 alpha 3 Released
http://bugs.php.net - bug reporting system
https://github.com/php/php-src/blob/php-7.3.0alpha3/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.3.0alpha3/UPGRADING - UPGRADING
https://gist.github.com/cmb69/e2e76ac0072474b019b0c9f1aef249f1 - the manifest
http://php.net/archive/2018.php#id2018-06-21-1 - PHP 7.3.0 alpha 2 Released
https://github.com/php/php-src/blob/php-7.3.0alpha2/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.3.0alpha2/UPGRADING - UPGRADING
https://gist.github.com/cmb69/2c54d0972b296a905062f52c0852e7cb - the manifest
http://php.net/archive/2018.php#id2018-06-07-1 - PHP 7.3.0 alpha 1 Released
https://downloads.php.net/~stas/ - download page
https://github.com/php/php-src/blob/php-7.3.0alpha1/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.3.0alpha1/UPGRADING - UPGRADING
https://gist.github.com/smalyshev/b0994d4dd138007237911429702ee040 - the manifest
http://php.net/archive/2018.php#id2018-02-01-1 - PHP 7.2.2 Released
http://www.php.net/ChangeLog-7.php#7.2.2 - ChangeLog
http://php.net/archive/2017.php#id2017-10-12-1 - PHP 7.2.0 Release Candidate 4 Released
https://bugs.php.net/ - bug tracking system
https://github.com/php/php-src/blob/php-7.2.0RC4/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.2.0RC4/UPGRADING - UPGRADING
https://downloads.php.net/~remi/ - download
http://windows.php.net/qa/ - windows.php.net/qa/
https://wiki.php.net/todo/php72 - our wiki
http://php.net/archive/2017.php#id2017-09-28-2 - PHP 7.2.0 Release Candidate 3 Released
https://github.com/php/php-src/blob/php-7.2.0RC3/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.2.0RC3/UPGRADING - UPGRADING
http://php.net/archive/2017.php#id2017-08-31-1 - PHP 7.2.0 Release Candidate 1 Released
https://github.com/php/php-src/blob/php-7.2.0RC1/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.2.0RC1/UPGRADING - UPGRADING
http://php.net/archive/2017.php#id2017-08-17-1 - PHP 7.2.0 Beta 3 Released
https://github.com/php/php-src/blob/php-7.2.0beta3/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.2.0beta3/UPGRADING - UPGRADING
http://php.net/archive/2017.php#id2017-07-06-2 - PHP 7.2.0 Alpha 3 Released
https://github.com/php/php-src/blob/php-7.2.0alpha3/NEWS - NEWS
https://github.com/php/php-src/blob/php-7.2.0alpha3/UPGRADING - UPGRADING
https://wiki.php.net/todo/php72#timetable - wiki
/archive/ - Older News Entries
/conferences - Conferences calling for papers
http://php.net/conferences/index.php#id2018-12-15-1 - SunshinePHP 2019
http://php.net/conferences/index.php#id2018-11-20-2 - Dutch PHP Conference 2019
http://php.net/conferences/index.php#id2018-10-12-2 - International PHP Conference 2019 - Spring Edition
http://php.net/conferences/index.php#id2018-12-10-1 - PHPKonf Istanbul PHP Conference 2019
http://php.net/conferences/index.php#id2018-11-20-1 - Dutch PHP Conference - CfP is open!
/cal.php - User Group Events
/thanks.php - Special Thanks
https://twitter.com/official_php -
/copyright.php - Copyright © 2001-2018 The PHP Group
/my.php - My PHP.net
/contact.php - Contact
/sites.php - Other PHP.net sites
/mirrors.php - Mirror sites
/privacy.php - Privacy policy
javascript:; -