Tracking the anchor text for the incoming links in Google Tag Manager

Introduction

It’s been a long time since I took care of this blog’s “Analytics” ( In the blacksmith’s house, a wooden knife). And I noticed that would be cool having the info about the Anchor Text the sites referring to my sites are using to link me.

So I’m sharing the solution I built today in order to capture which Anchor Text was on the referring URLs and sending the info back to Google Tag Manager and from there we’ll be able send an event to APP+WEB or to any other place we want 🙂



How it works


Execution Flow Chart

The flow chart on the right side, shows how the executions flow works. We’ll have 2 main pieces:

– One GTM CUSTOM HTML Tag
– One PHP File

The first one will the responsible of doing the main logic and doing a XMLRequest call to the second one that will take care of reading the current visitor referrer page and scrape it in order to try to find the current Anchor Text that the user clicked.

We’re using extense logic to void any kind of false positives/duplicate hits. For example when an user goes back into a mobile phone or swipes. We don’t want to consider these “page reloads” as landings despite they may still hold a valid referrer info.

SERVER SIDE CODE

PHP Snippet Code

First we need to upload the following php snippet to any server supporting PHP 7.x ( because of the use of arrays literals ).

This code can be highly improved for example for adding a timeout in the case the page is not reachable. If someone asks I may adding more sanity check for the script.

// David Vallejo (@thyngster)
// 2020-04-14
// Needs PHP7.X

if(!isset($_GET["url"])){
        die("missing url parameter");
}

$links = [];
if(isset($_SERVER["HTTP_REFERER"])){
        $url = $_GET["url"];
        $referrer_link_html_content = file_get_contents($url);
        $current_domain = str_replace("www.","", parse_url($_SERVER["HTTP_REFERER"], PHP_URL_HOST));
        $doc = new DOMDocument();
        $doc->loadHTML($referrer_link_html_content);

        $rows = $doc->getElementsByTagName('a');
        foreach ($rows as $row)
        {
                if($row instanceof DOMElement){
                        preg_match_all('/'.$current_domain.'/i', $row->getAttribute('href'), $matches, PREG_OFFSET_CAPTURE);
                        if(count($matches[0]) > 0){
                                $links[] = [
                                        "url" => $row->getAttribute('href'),
                                        "anchor_text" => $row->textContent
                                ];
                        }
                }
        }
}
header('Content-type: application/json; charset=UTF-8');
header("Access-Control-Allow-Origin: *");
echo json_encode($links, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);
exit;

Python Snippet code

I know this code is not the best one since I’m not a python coder, but it can give an overall idea about how to run this based on the Python.

should be used like:

python anchor.py REFFERER_LINK LINKTOSEARCH

# use: python anchor.py REFFERER LINKTOSEARCH
#!/usr/bin/env python
import json
import urllib2
import requests
import sys
from bs4 import BeautifulSoup
from urlparse import urlparse

links = []

if len(sys.argv) > 1:
    url = sys.argv[1]
else:
    print("URL argument is missing")
    sys.exit()

if len(sys.argv) > 2:
    referrer = sys.argv[2]
else:
    print("REFERRER argument is missing")
    sys.exit()

headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.text, "html.parser")

for ahref in soup.select('a[href*="'+urlparse(referrer).netloc.replace("www.", "")+'"]'):
        links.append({
                "url": ahref.attrs["href"],
                "anchor_text": ahref.text
        })

print json.dumps(links, sort_keys=True,indent=4, separators=(',', ': '))

GTM Custom HTML Code

NOTE Remember that the following code needs to be added to GTM wrapped between <script></script> tags!

Also remember that we need to update the endPointUrl value to the domain where we’ve uploaded the PHP script

  (function(){
    try{
	  var endPointUrl = 'https://domain.com/getLinkInfo.php';
      // We don't want this to run on page reloads or navigations. Just on Real Landings
      if (window.performance && window.performance.navigation && window.performance.navigation.type === 0) {
          var referrer = document.referrer;
          var current_url = document.location.href;

          var grab_hostname_from_url = function(url) {
              var h;
              var a = document.createElement("a");
              a.href = url;
              h = a.hostname.replace('www.', '');
              return h;
          }
          // Only continue if the current referrer is set to a valid URL
          if (referrer.match(/^(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+$/)) {
              // current referrer domain != current_domain
              console.log(grab_hostname_from_url(grab_hostname_from_url(referrer).indexOf(grab_hostname_from_url(current_url)) === -1))
              if (grab_hostname_from_url(referrer).indexOf(grab_hostname_from_url(current_url)) === -1) {
                  fetch(endPointUrl+ '?url=' + referrer).then(function(response) {
                      return response.json();
                  }).then(function(json) {
                      json.forEach(function(link) {
                          if (current_url.indexOf(link.url)>-1) {
                          //if (current_url===link.url.indexOf) {
                              window.dataLayer.push({
                                  event: 'incoming-link',
                                  linked_url: link.url,
                                  landing_url: document.location.href,
                                  referring_url: referrer,
                                  anchor_text: link.linkText
                              });
                          }

                      })
                  });
              }
          }
      }
      
    }catch(e){}   
  })();

Now we’re only one step away of having this working, we’ll need to setup a firing trigger for our tag, this ideally should be the All Pages trigger to get it fired asap.

Reported Data Info

dataLayer KeydataLayer Value
eventincoming-link
linked_urlCurrent Link in the Referral Page
landing_urlCurrent URL
referring_urlFull Referrer Info
anchor_textThe Anchor Text on the referrer page linking to your site

Caveats

Please note that this solution relies on the current document.referrer, so don’t expect it to work for all referrals since some of them may be stripping the full referrer info, like Google SERPS do, or even some browser may end stripping the referrer details down to origin for privacy reason.

Also it may happens that the referring URL is linking to us in more than 1 place, on this case the scraping endpoint will return all the links and anchors texts matching. From that point of, it’s up to you how you report it in Google Analytics or any too 😀

In any case this should work for most of the common referrals traffic.

Working Demo Video

3 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.