Parsing HTML in PHP using Native Classes

If you have ever dealt with DOM (Document Object Model) manipulation on the front-end, you know the basics of parsing HTML using Javascript. All these happen usually on the client-side. But what if we want to process HTML data on the server? The use cases are broader than just DOM manipulation. In this post, let us look at some of the useful PHP classes which enables us to process HTML on a server.

What is Parsing & What are its Uses?

Parsing (in this case) is the process of extracting or modifying useful information from an HTML or XML string. A parser gives us easy ways to query raw data instead of using regex.

Suppose you want to get all the links on a web page. PHP DOM parsing classes can help you.

The Table of Contents you see above is another simple application of PHP DOM parsing classes. In that plugin, it extracts all the headings from the page, sorts it, creates a new element, and inserts it back into the page content.

Important DOM classes in PHP

There are around nineteen DOM-related classes in PHP. Some of the important ones are:

  • DOMDocument (extends DOMNode class)
  • DOMNode
  • DOMNodeList
  • DOMXPath
  • DOMElement (extends DOMNode class)

DOMDocument, Nodes & Elements

The DOMDocument is the main class which takes in HTML and gives an object for us to interact with. It can load HTML or XML from a string or file. The class defines several methods like getElementById which resemble the functions in Javascript.

$dom = new DOMDocument();

//examples

//methods to load HTML
$dom->loadHTML($html_string);
$dom->loadHTMLFile('path/to/htmlfile.html');

//methods to load XML
$dom->load('path/to/xmlfile.xml');
$dom->loadXML($xml_string);

$documentElement = $dom->documentElement; 
//object of DOMElement Class which gives access to the document element

In this post, we will mainly think about HTML manipulation over XML.

Nodes

The DOM made from HTML is a tree-like structure made up of individual nodes. These nodes can be of any type, say an element, text, comment, attribute etc. DOMNode is the base class from which all types of node classes inherit.

Elements

The DOMElement class extends the DOMNode class which can represent the elements in your HTML markup. An object of DOMElement can be any element like an image, div, span, table etc.

Practical Examples

Without going more into the theories, let us dive into some practical examples. First of all, we want some HTML data. For that, let us use one of the posts in this blog about image optimization.

We will do the following jobs with our sample HTML:

  • Select element by Id
  • Get elements by its tag name
  • Find elements by class
  • Find all links in a page
  • Inserting HTML element
  • Deleting an element
  • Dealing with attributes

Here is the curl request:

header('Content-Type:application/json');
$url = "https://www.coralnodes.com/best-wordpress-image-optimization-plugins/";

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

$res = curl_exec($ch);

curl_close($ch);

The variable $res contains the whole HTML from the web-page.

Selecting by ID

If you look at our sample page, you can see that it contains two tables. Suppose I want to find the number of rows in the first table. Using chrome dev-tools, I found that the required table has the Id – #tablepress-3.

$dom = new DomDocument();
@ $dom->loadHTML($res);

$table = $dom->getElementById('tablepress-3'); //DOMElement
$child_elements = $table->getElementsByTagName('tr'); //DOMNodeList
$row_count = $child_elements->length - 1;

echo "No. of rows in the table is " . $row_count;

The above code gives the output:

No. of rows in the table is 10

Selecting a Tag by Its Name

Both the DOMDocument and DOMElement classes have the method getElementsByTagName() which allows us to select elements using the name of the tag. For example, if we have to get all the h2 headings from a page, we can use this function.

$dom = new DomDocument();
@ $dom->loadHTML($res);

$h2s = $dom->getElementsByTagName('h2');
foreach( $h2s as $h2 ) {
    echo $h2->textContent . "\n";
}

The result:

Test Images
Results after Compression
ShortPixel
reSmush.it
Imagify
TinyPNG Compress JPEG & PNG Images
Kraken.IO
EWWW Image Optimizer
WP Smush
Do you actually need a Plugin to Optimize Images?
Consclusion

Find elements with a particular class

In Javascript, the querySelectorAll() method makes it easy to select any elements using a CSS selector. In PHP, it is not that straightforward. Instead, the DOMXpath class in PHP helps to query and traverse the DOM tree easily.

Example: Select all the tables with the class tablepress.

$dom = new DomDocument();
@ $dom->loadHTML($res);

$xpath = new DOMXpath($dom);
$tables = $xpath->query("//table[contains(@class,'tablepress')]");
$count = $tables->length;

echo "No. of tables " . $count;

Just like getElementByTagName(), the query() method of DOMXpath also returns a DOMNodeList. It takes an expression as an argument. This XPath expression is so versatile that we can perform almost any type of queries.

If you are new XPath, this cheatsheet from Devhints.io contains a wide list of CSS & JS selectors and their corresponding XPath expressions. It will help you in finding out the appropriate expression for the query you want to perform.

Extract links from a page

Parsing opens a number of opportunities. Extracting the links from a web-page is one such use. That’s how crawlers crawl the world wide web.

Suppose I want to find all the external links to a particular website on a web-page. In our sample page, what I like to do is to find all the outbound links to the wordpress.org website from the blog post. So, this is how I did it.

$dom = new DomDocument();
@ $dom->loadHTML($res);

$links = $dom->getElementsByTagName('a');
$urls = [];
foreach($links as $link) {
    $url = $link->getAttribute('href');
    $parsed_url = parse_url($url);
    if( isset($parsed_url['host']) && $parsed_url['host'] === 'wordpress.org' ) {
        $urls[] = $url;
    }
}
var_dump($urls);

Modifying & Saving HTML

So far we saw how to extract or select the required data from HTML. Now, let us see how we can modify it by adding or deleting elements and attributes.

Inserting new HTML element into the document

In this example, we will see how to add an image with a link after the first paragraph. This is how you insert banner ads between posts.

$dom = new DomDocument();
@ $dom->loadHTML($html);

$ps = $dom->getElementsByTagName('p');
$first_para = $ps->item(0);

$html_to_add = '<div><a hreh="#"><img src="image.jpeg"/></a></div>';
$dom_to_add = new DOMDocument();
@ $dom_to_add->loadHTML($html_to_add);
$new_element = $dom_to_add->documentElement;

$imported_element = $dom->importNode($new_element, true);
$first_para->parentNode->insertBefore($imported_element, $first_para->nextSibling);

$output = @ $dom->saveHTML();
echo $output;

Note that The saveHTML() method return the manipulated html string.

Deleting an element from the document

To delete an element from our HTML, we can make use of the removeChild() method from the DOMElement class.

$html = '<p>This is our first paragraph</p>
<div class="del">Delete this</div>
<p>This is our second paragraph</p>
<p>This is our third paragraph</p>
<div class="del">Delete this too</div>';

$dom = new DomDocument();
@ $dom->loadHTML($html);
$documentElement = $dom->documentElement;
echo $dom->saveHTML();

$xpath = new DOMXpath($dom);
$elems = $xpath->query("//div[@class='del']");

foreach( $elems as $elem ) {
    $elem->parentNode->removeChild($elem);
}
echo '<br><br>-------after deletion--------<br><br>';
echo $dom->saveHTML();

Here we have performed an XPath query to find all the elements with the class del. Then we remove each node from the document by iterating over the DOMNodeList object using a foreach loop.

This is our first paragraph
Delete this
This is our second paragraph
This is our third paragraph
Delete this too

-------after deletion--------

This is our first paragraph
This is our second paragraph
This is our third paragraph

Manipulating Attributes

Classes and Ids are not the only attributes we can access in PHP DOM. The DOMElement class has several functions which can get, set or remove attributes from an element. These methods look similar to that of Javascript. So you will find it easy to understand.

  • getAttribute($attribute_name)get the value of an attribute
  • setAttribute($attribute_name, $attribute_value) – set the value of an attribute
  • hasAttribute($attribute_name) – checks whether an element has a certain attribute and returns a true or false
$html = '<span class="myclass" data-action="show">Content</span>';
$dom = new DomDocument();
@ $dom->loadHTML($html);
$elem = $dom->getElementsByTagName('span')->item(0);

if( $elem->hasAttribute('data-action') ) {
    echo 'attribute value is "' . $elem->getAttribute('data-action') . '"';
    $elem->setAttribute('data-action', 'hide');
    echo '<br>updated attribute value is "' . $elem->getAttribute('data-action') . '"';
}

Conclusion

So far, we have looked into some of the important DOM APIs in PHP. I hope that it will help you to get started in parsing HTML and XML data with ease. If I am not clear in certain points, do ask it in the comments.

Sharing is caring!

Leave a Reply

Your email address will not be published. Required fields are marked *

shares