David Vallejo - Web Analyst

Tracking the anchor text for the incoming links in Google Tag Manager

Web Analytics

Introduction

It's been a long time since I took care of this blog's "Analytics" ( In the blacksmith's house, a wooden knife). And I noticed that would be cool having the info about the Anchor Text the sites referring to my sites are using to link me.

So I'm sharing the solution I built today in order to capture which Anchor Text was on the referring URLs and sending the info back to Google Tag Manager and from there we'll be able send an event to APP+WEB or to any other place we want :)



How it works


Execution Flow Chart

The flow chart on the right side, shows how the executions flow works. We'll have 2 main pieces:

- One GTM CUSTOM HTML Tag
- One PHP File

The first one will the responsible of doing the main logic and doing a XMLRequest call to the second one that will take care of reading the current visitor referrer page and scrape it in order to try to find the current Anchor Text that the user clicked.

We're using extense logic to void any kind of false positives/duplicate hits. For example when an user goes back into a mobile phone or swipes. We don't want to consider these "page reloads" as landings despite they may still hold a valid referrer info.

SERVER SIDE CODE

PHP Snippet Code

First we need to upload the following php snippet to any server supporting PHP 7.x ( because of the use of arrays literals ).

This code can be highly improved for example for adding a timeout in the case the page is not reachable. If someone asks I may adding more sanity check for the script.


// David Vallejo (@thyngster)
// 2020-04-14
// Needs PHP7.X

if(!isset($_GET["url"])){
        die(
"missing url parameter");
}

$links = [];
if(isset(
$_SERVER["HTTP_REFERER"])){
        
$url $_GET["url"];
        
$referrer_link_html_content file_get_contents($url);
        
$current_domain str_replace("www.",""parse_url($_SERVER["HTTP_REFERER"], PHP_URL_HOST));
        
$doc = new DOMDocument();
        
$doc->loadHTML($referrer_link_html_content);

        
$rows $doc->getElementsByTagName('a');
        foreach (
$rows as $row)
        {
                if(
$row instanceof DOMElement){
                        
preg_match_all('/'.$current_domain.'/i'$row->getAttribute('href'), $matchesPREG_OFFSET_CAPTURE);
                        if(
count($matches[0]) > 0){
                                
$links[] = [
                                        
"url" => $row->getAttribute('href'),
                                        
"anchor_text" => $row->textContent
                                
];
                        }
                }
        }
}
header('Content-type: application/json; charset=UTF-8');
header("Access-Control-Allow-Origin: *");
echo 
json_encode($linksJSON_PRETTY_PRINT JSON_UNESCAPED_UNICODE JSON_UNESCAPED_SLASHES);
exit; 

Python Snippet code

I know this code is not the best one since I'm not a python coder, but it can give an overall idea about how to run this based on the Python.

should be used like:

python anchor.py REFFERER_LINK LINKTOSEARCH


# use: python anchor.py REFFERER LINKTOSEARCH
#!/usr/bin/env python
import json
import urllib2
import requests
import sys
from bs4 import BeautifulSoup
from urlparse import urlparse

links 
= []

if 
len(sys.argv) > 1:
    
url sys.argv[1]
else:
    print(
"URL argument is missing")
    
sys.exit()

if 
len(sys.argv) > 2:
    
referrer sys.argv[2]
else:
    print(
"REFERRER argument is missing")
    
sys.exit()

headers = {'User-Agent''Mozilla/5.0'}
response requests.get(urlheaders headers)
soup BeautifulSoup(response.text"html.parser")

for 
ahref in soup.select('a[href*="'+urlparse(referrer).netloc.replace("www.""")+'"]'):
        
links.append({
                
"url"ahref.attrs["href"],
                
"anchor_text"ahref.text
        
})

print 
json.dumps(linkssort_keys=True,indent=4separators=(','': ')) 

GTM Custom HTML Code

NOTE Remember that the following code needs to be added to GTM wrapped between <script></script> tags!

Also remember that we need to update the endPointUrl value to the domain where we've uploaded the PHP script


  (function(){
    try{
      var 
endPointUrl 'https://domain.com/getLinkInfo.php';
      
// We don't want this to run on page reloads or navigations. Just on Real Landings
      
if (window.performance && window.performance.navigation && window.performance.navigation.type === 0) {
          var 
referrer document.referrer;
          var 
current_url document.location.href;

          var 
grab_hostname_from_url = function(url) {
              var 
h;
              var 
document.createElement("a");
              
a.href url;
              
a.hostname.replace('www.''');
              return 
h;
          }
          
// Only continue if the current referrer is set to a valid URL
          
if (referrer.match(/^(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+$/)) {
              // current referrer domain != current_domain
              
console.log(grab_hostname_from_url(grab_hostname_from_url(referrer).indexOf(grab_hostname_from_url(current_url)) === -1))
              if (
grab_hostname_from_url(referrer).indexOf(grab_hostname_from_url(current_url)) === -1) {
                  
fetch(endPointUrl'?url=' referrer).then(function(response) {
                      return 
response.json();
                  }).
then(function(json) {
                      
json.forEach(function(link) {
                          if (
current_url.indexOf(link.url)>-1) {
                          
//if (current_url===link.url.indexOf) {
                              
window.dataLayer.push({
                                  
event'incoming-link',
                                  
linked_urllink.url,
                                  
landing_urldocument.location.href,
                                  
referring_urlreferrer,
                                  
anchor_textlink.linkText
                              
});
                          }

                      })
                  });
              }
          }
      }
      
    }catch(
e){}   
  })(); 

Now we're only one step away of having this working, we'll need to setup a firing trigger for our tag, this ideally should be the All Pages trigger to get it fired asap.

Reported Data Info

dataLayer KeydataLayer Value
eventincoming-link
linked_urlCurrent Link in the Referral Page
landing_urlCurrent URL
referring_urlFull Referrer Info
anchor_textThe Anchor Text on the referrer page linking to your site

Caveats

Please note that this solution relies on the current document.referrer, so don't expect it to work for all referrals since some of them may be stripping the full referrer info, like Google SERPS do, or even some browser may end stripping the referrer details down to origin for privacy reason.

Also it may happens that the referring URL is linking to us in more than 1 place, on this case the scraping endpoint will return all the links and anchors texts matching. From that point of, it's up to you how you report it in Google Analytics or any too :D

In any case this should work for most of the common referrals traffic.

Working Demo Video