SEO meets GA: Tracking search bots visits within measurement protocol

I’ve been attending lately (and having) to some talks about the logs parsing from the SEO perspective, (from @David Sottimano on Untagged Conference and Lino Uruñuela during some dinner time), and I’ve decided to publish a WordPress plugin that I started to work on some years ago, and that for work reasons I had it left on my “I’ll do it” drawer and it never came back to my mind.

First thing I need to the point to, is that this is a BETA PLUGIN, so please careful of using it on a high load trafic or on a production site. I’ve running on this site for 4 days without any problems, but that doesn’t mean it’s free of bugs. Let’s consider this plugin for now as a proof of concept.

The main task of the plugin is to register the search bots visits to our wordpress site into Google Analytics, using the Measurement Protocol.

The working flow of the plugin is easy, it just checks if the current visiting User Agent is matching any known Crawler, and based on that info we’re sending a pageview to some Google Analytics Property. Please take in mind that it’s recommended to use a new property since, we’re going to use a lot of custom dimensions to track some extra info beside the visited pages =)

I used to had my own User Agents parser, but I ended using another well stablished (and for sure more reliable) library. When something works there’s no need to reinvent the wheel :). So this pluggin uses the PHP library for the uap-core project.

Let’s see a simple flow chart about what the plugin does:

I’m sure this was easy enough to understand. But don’t only want to check what pageviews were visited by a search bot, no we’re going further and we’ll be tracking the following:

  • The Bot Name ( Googlebot, Bingbot)
  • The Bot Version (Desktop, Smartphone, Feature Phone)
  • The Response Code Status (200,404)
  • The page generation Time (In ms)
  • Total Memory used to render the HTML (in MB)
  • Total Queries needed to return the HTML to the bot (an integer with total mySQL queries needed).
  • An UserID for the bot (this is based on the IP Long value for the current Bot ).
  • A clientID (An UUIDv4 strnig based on the bot IP address, that will allow us to check how often that same bot returns to our site, and that will allow us the track the specific pages being crawled by a specific bot for each session).
  • The real Bot user agent, in order to debug and improve our detection engine.So know we’ll be able to answer the following questions:
  • Which bots visits my content
  • Which content was viewed by each different bot
  • When was this content crawled for first time
  • What 404 pages are being craweler by which search bots
  • How often is GoogleBot or any other search bot is visiting my domain or an specific content
  • How many different bots (ip addresses) had visited my site, and how often they come back to the site
  • Which pages did each bot crawled on each session

And for sure you may find replies to a lot of more questions, since we’re using Google Analytics to track those visits, we’ll able to cross any of the dimensions at our needs.

Another cool thing of tracking the bots crawls within the Measurement protocol, is that we’ll be able to watch how our site is being crawled in the real time reports! 🙂

Setup

You’ll just need to download the plugin zip file from the following url, and drop it in your WordPress Plugins folder and configure the Google Analytics Property ID to where you want to send your data.

Used Custom Dimensions

You may be wondering why do we have the same bot info related dimensions duplicated and with a different scope, this is why because as I explained before we’re using the bot IP address to build up a clientID and an userID, and it may happen that Google uses the same ip for different bots (like for Desktop or Featured Phone). This way we can have the hit level info too in the case that user scope data get’s overriden 🙂

Another thing we may want to do, is to setup the session timeout limit to 4 hours within our profile configuration. Bots Crawls are not done the same wht as an user navigates the page, and we may be getting 2 pages hits per hour, so the default 30 minutes timeout makes not sense at all.

Let’s know see how the reports will look on Google Analytics 🙂

Consumed content by bots with an hourly breakdown

 

Total sessions and pageviews by search bot

 

Pages that returned an 404 and which bot was crawling it

Which pages did a certain bot crawled (User Explorer Report)

 

You can get the plugin from the following GitHub repository:
https://github.com/thyngster/wp-seo-ga

If you are unable to run the plugin, please drop me a comment on this post or open an issue on GitHub and I’ll try to take a look to it.

Any suggestions/improvement will be very welcome too 🙂

Cross-Domain tracking with clean urls

I’ve been told by a lot of clients that the way that Google Analytics cross-domain tracking works is “ugly”, referring to having the linker param attached to the URL.

I must admit is not elegant having all that long hash on the url, thougt it won’t affect the page functionality. In the other side there isn’t any other to pass the current client Id from the Universal Analytics cookie to the destination domain without dealing with server-side hacks (we can’t not read POST data in JS ,yet).

Browsers have the History API . Which holds the current user navigation history,allows us to manipulate it and is widely supported by browsers:

history api support by browser

If you ever dealed with an Ajax based website, I’m sure you have noticied that even if the page does not reload, the url gets changed.

The history API does allow us to play with the current user session history, for example:

window.history.length

The above line will return the number of elements in the session history, if you have browse 4 pages in the current it’ll return 4.

window.history.back()

Will return the user back to the previous page in the session.

But we’re going to focus on the pushState and replaceState methods. Those ones will allow us to add a new entry to the history record and will allow us to change the current page pathname without needing to reload the page.

I bet you’re guessing that we’re going to strip out the _ga parameter with those functions and you’re right. This won’t be harmful for the crossdomain tracking since we’re going to do it after the Google Analytics object has been created so it won’t affect our implementation but we’ll end showing the user a cleaned up URL after Google Analytics does all it’s cross-domain tracking magic.

We’ll using the “replaceState” in this example, to avoid users clicking on back button to be sent to the same. This method will just change the URL but WON’T add a new entry to the session history.

To achive this hack, we’ll be using the hitCallback for our Pageview Tag on Google Tag Manager.

In first place, we are going to need a variable that is going to take care of reading the current URL, cleaning it up, and manipulating the browsers URL using the History API.

I’m calling it “remove _ga from url pushState” , feel free to name it at your convenience:

function(){
  return function(){
      if (document.location.search.match(/_ga=([^&]*)/)) {
          var new_url;
          var rebuilt_querystring;
          // A small function to check if an object is empty
          var isEmptyObject = function(obj) {
              var name;
              for (name in obj) {
                  return false;
              }
              return true;
          }
          // Let's build an object with a key-value pairs from the current URL
          var qsobject = document.location.search.replace(/(^\?)/, '').split("&").map(function(n) {
              return n = n.split("="),
              this[n[0]] = n[1],
              this
          }
          .bind({}))[0];
          // Remove the _ga parameter
          delete qsobject['_ga'];
          // Let's rebuilt the querysting from the previous object with the _ga parameter removed
          var rebuilt_querystring = Object.keys(qsobject).map(function(k) {
              if (!qsobject[k]) {
                  return encodeURIComponent(k);
              } else {
                  return encodeURIComponent(k) + '=' + (encodeURIComponent(qsobject[k] || ""));
              }
          }).join('&');
          // We want to if the current querystring was null
          if (isEmptyObject(qsobject)) {
              new_url = location.pathname + location.hash;
          } else {
              new_url = location.pathname + '?' + rebuilt_querystring + location.hash;
          }
          // Use replace State to update the current page URL
          console.log(new_url);
          window.history.replaceState({}, document.title, new_url);
      }
    }
}

Now we only need to add this new variable as the hitCallBack value for our pageview tag:

So this is what is going to happen now:

1. Google Analytics Object will be created
2. It will process the linker parameter, overriding the current landing domain clientId value as long as the linkerParam value is legit
3. After that the current page URL will be changed for the same URL but with the _ga parameters stripped out.

Bringing back utm_nooverride functionality to Universal Analytics

Universal Analytics removed the utm_nooverride=1 functionality, still we can define a list domain referrals to be treated as direct visits within our properties configuration section, but what about when we can’t control the source domains?, for example for emailings, or some display campaign that we don’t want to override our users original attribution?.

We’re going to use Google Tag Manager, so bring back this functionality to our implementations.

First we need a Variable to read if is there a querystring parameter named utm_nooverride and that it’s value.

Ok, this variable will hold the value “1” when the utm_nooverride parameter is present. Now we’re going to use it to force the “dr” (document referrer) parameter just under that situation.

For that we’re going to need an extra Custom JavaScript variable with the following code on it:

Let’s be lazy!, you can copy this little piece of code below:

function(){
  if({{QS - utm_nooverride}}=="1"){
      return document.location.origin;
  }else{
      return document.referrer;
  }
}

We’re almost set, now we want to force our pageview tag to use this last created variable for the “referrer” field.

We’re done!, now if the utm_nooverride parameter is present on the landing page, Google Tag Manager will send the current domain name as a referrer, forcing that new visit to be threated as direct traffic.

UPDATE: I don’t recall if the override had preference over campaign parameters, if you know about it, please drop a comment :). Else I’ll be checking it on the next days.

 

#Tip – Finding out if a key-value has been already pushed into the dataLayer

Sometimes we may be in the situation that we need to know if some info had been already pushed into Google Tag Manager’s dataLayer in order to take some action. For example we may need to know if some custom event it’s already in the dataLayer in order to prevent something to be fired twice, or we may need to check if some value is already in place.

The following snippet, will help us to know that info, looping thru all the dataLayer pushes and returning -1 if the key-value was not found, or the current dataLayer push index that matches.

var isValueInDatalayer = function($key, $val) {
    var res = -1;
    var dlname;
    for (i in window.google_tag_manager) {
        if(typeof(window.google_tag_manager[i])=="object" && window.google_tag_manager[i].gtmDom)
            dlname = i;        
    }        
    if (typeof (window.google_tag_manager) != "undefined") {
        for (i in window[dlname]) {
            if (!$val & $val!='') {                
                if (window[dlname][i][$key]) {
                    return i;
                }
            } else {
                if (window[dlname][i][$key] && window[dlname][i][$key] == $val) {
                    return i;
                }
            }
        }
    }
    return res;
};

Let’s see how why can use it, we’re going to imagine that we are on a vanilla GTM container and with no dataLayer pushes from the page:


isValueInDatalayer('event');

This call above will return 0, since it will match the first event key pushed.


isValueInDatalayer('event','gtm.load');

The call above will return 2, and the gtm.load event was found on the 2 position within the dataLayer pushes array (take note that it starts with on 0).


isValueInDatalayer('non-existent-key');
isValueInDatalayer('event','non-existent-event-value');

The two last example will return -1, since they won’t match the current data in our dataLayer.


If you’re using a custom dataLayer variable name, there’s no need for you to modify the code since it’ll autodiscover your current dataLayer variable name.

Let me know any improvement or ideas for this useful function on the comments 🙂

How to keep your returning user’s legacy data when switching domain name

When we’re switching a site domain name we always have in mind some basic steps to take in mind so the migration doesn’t end being a mess. One of those steps is usually 301-ing our old domain content to the new one, but we never think on how will this affect our current Google Analytics data.

Universal Analytics cookie is based on the domain hostname, so if we switch the current domain a new cookie set will be created along with a new client ID, forcing that all the visits we redirect will end being new visitors. This mean we’ll be losing ALL our previous attributions/history data for returning visitors., doh!

This time, we’ll try to mitigate this problem using Google Tag Manager and some Mod Rewrite (htaccess) magic.

We’ll be using Apache’s Rewrite module to read the current user “_ga” cookie and passing it along our redirection, then from GTM we’ll force the clientId within our tracker in order to keep our old users clientId for our new domain 🙂

Below you can find our .htaccess. As you can see we check for _ga cookie value, and then we redirect the user to the new domain with a new parameters named “_mga” , that is going to hold the _ga cookie value and the timestamp.

RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_COOKIE} _ga=([^;]+) [NC]
RewriteRule ^(.*)$ http://www.new-domain.com$1?__mga=%1\.%{TIME} [R=301,QSA,L]
RewriteRule ^(.*)$ http://www.new-domain.com$1 [R=301,QSA,L]

You may be asking yourself about why we are adding the current timestamp (%{TIME}) as a parameter value. The reason is pretty simple, we don’t want someone sharing that url to someone else and end having a lot of users sharing the same clientId, do we?

We’ll use that value later on Google Tag Manager to check is the redirection was generated less than 120 seconds ago if not we’ll just return any value. This is how native Universal cross-domain feature works too!

if ({{__mga.timestamp}}) > 120)
    return;

If _mga.timestamp and current user timestamp values substract is higher than “120”,  it means it was generated more than 2 minutes ago, so we don’t want to push any clientId back on this case.

The current format for the %{{TIME}} value from mod rewrite is the following:

{{YEAR}}{{MONTH}}{{DAY}}{{HOUR}}{{MINUTE}}{{SECOND}}

And it will likely be using the UTC timezone. This is important since the check will be made client-side, and we’re gonna need to check the current user time in UTC time, not the current client timezone.

GTM Configuration

On the Google Tag Manager side, we’ll need one variable that will take care of grabing the _mga value,  and from there we’ll get the clientId and the link generation time.

Then we’ll be checking the current user browser’s UTC timestamp to see if this link was generated less than 120 second ago, to know if we should be returning any value.

Grab the variable code bellow:

function(){
    // Let's grab our custom linker value from QS
    var _mga_linker_value = document.location.search.match(/_mga=([^&]*)/)[1].split('.').pop();

    // Let's convert the YYYYMMMDDHHMMSS date to timestamp format
    var _mga_date = new Date(_mga_linker_value.slice(0, 4), _mga_linker_value.slice(4, 6) - 1, _mga_linker_value.slice(6, 8), _mga_linker_value.slice(8, 10), _mga_linker_value.slice(10, 12), _mga_linker_value.slice(12, 14));

    // Let's add the current browser timezone offset
    var _mga_timestamp_utc = Math.round(_mga_date*1/1000)-new Date().getTimezoneOffset()*60;

    // This is the current browser UTC time
    var _browser_timestamp_utc = new Date()*1;

    // This is going to be the total seconds diff, between linker creation time and current user's browser time
    var _linking_offset_in_sec = Math.round(_browser_timestamp_utc/1000 - _mga_timestamp_utc);

    // Let's force the clientId value ONLY if the time difference is less than 2 minutes
    if(_linking_offset_in_sec<120){
            return document.location.search.match(/_mga=([^&]*)/)[1].match(/GA1\.[0-9]\.([0-9]*.[0-9]*)/)[1];            
  	}
}

Now we only need to use the returned value by this variables as the “clientId” value on our tracker this way:

Of this this may not be only applied for Google Analytics but for any other cookie value you want to keep, just modify the code to grab any other cookie value you may need

Google Tag Manager event tracking using data attribute elements

On the last #tip we talked about how to debug/qa our data attributes , and now we’re going to learn about how to natively track events/social interactions within Google Tag Manager .

We’re going to learn it, basing our tracking on Google Analytics Events and Social Interactions. Of course this can be expanded to any other tool just changing the data attributes, but hey, this is about to learning not about give me a copy and paste solution.

Let’s start saying that data-* attributes it’s and standard HTML5 mark up that we can use to manage our page functionality based on that data instead of relaying on classes or id.
A data attribute is intended to store values that are mean to the page or application and that doesn’t fit in any other appropiate attributes.

In our care the data that we’re storing is the hitype that we’ll be firing. In our example it could an “event” or a “social interaction” . For this we’re setting a data attribute named “wa-hittype“, and this attribut will hold the current hit to be fired, in our case “event” or “social”.

We’ll be using some other data attributes to define our events category, action, label, value and non-interactional switch, please take a look to the following table for more details:

Data Attr Description
data-wa-hittype Type of hit we want to fire on user's click
data-wa-event-category The Category value for our event
data-wa-event-action The action value for our event
data-wa-event-label *optional. The label for our even
data-wa-event-value *optional. The value for our event if any
data-wa-event-nonint *option. Is the event non interactional?

Let’s check an example:

<a 
 data-wa-hittype="event" 
 data-wa-event-category="Ecommerce" 
 data-wa-event-action="Add To Cart" 
 data-wa-event-label="SKU0001" 
 data-wa-event-value="12.00"
 href="#"
>Add To Cart<a/>

So we have a data attribute that will allow us to know when fire a tab based on a CSS selector, and we’ve too all the info needed to populate the information for our event.

Next step is to configure some variables to read these values when the user clicks on the element.

So now when the user clicks on some element, we’ll have all our event needed data on those new variables. Let’s work on the trigger that will make our tag to fire.

We’re using the In-build {{Click Element}} Variable and some magic with a CSS Selector.

There we’re, now we just need to setup our event tag, add our variables to the tag fields, and set the trigger on this new event tag.

Now everytime you need to track a new click on some page element, you’ll just need to ask the developers to add some “standard” data mark-up to the right element.  Even if you do something wrong, the variables will take care of fixing the values were possible (like an event value expecting an integer value instead of a string) or setting a right boolean value for the non-interactional switch for the event.

Any suggestion or improvement to this tracking method is welcome 🙂

P.D. Yeah! I know I talked about tracking Social Interactions too, but I’m pretty sure that you’ll be able to figure it out. Think about like a good moment to learn how to do things instead of just trying to copy and paste and hoping it will work.

GAUPET Release: Google Analytics User Permissions Explorer Tool

Some months ago I asked some friends to test a new tool I was working on and past week I released something close to an open alpha, today after pulling some details, a new UI redesign 100% mobile compatible. I’m announcing the GAUPET release.

At first I named it as GA Governance Tool, but after some interesting chat with the “osom” Yehoshua Coren . I(we)’ve decided to change the tool’s name to something that it’s closer to what it is and here is it: GAUPET , which stands for Google Analytics User Permissions Explorer Tool. (yep, you’re right I didn’t rack my brain on this one)

You can find It the the following link : GAUPET

This will allow you to easily manage and pivot all your Google Analytics users and permissions in order to have a clear view of your current accounts User Governance status.

GAUPET will allow you to gather all your account user emails and permissions and draw them into an interactive pivot table. Even will allow you to merge different accounts users within the same report (thanks goes to Peter O’neill for this and another nice suggestions that will come in a future).

The tool comes with some predefined reports, but you will be able to pivot any data in the way you need. Just drag and drop the fields that’s it!.

The included fields are:

  • Email Address
  • Email Domain
  • Access Level
  • Account ID
  • Account Name
  • Account Access Rights
  • Account Permissions
  • Property ID
  • Property Name
  • Property Access Rights
  • Property Permissions
  • View ID
  • View Name
  • View Access Rights
  • View PermissionsLet’s take a look to a sample the report for user’s with view access:

    I’m offering this tool for free, and I’m hosting it for free, and this means that it’s offered “as it is”. Still you’ll have a feedback section on the page to report bugs, or ask for new features and I’ll try to make updates in my free time.

    Extra thanks fly to Damion Brown , Ani Lopez , Simo Ahava , Natzir Turrado , Doug Hall and Brian Clifton for their comments and testing. #tip Each of them worth a follow 🙂

#Tip – How to quickly debug/qa data attributes

With the years I learned that using CSS selectors to track user actions is really great but sadly I learned too that it’s really dangerous too.

It’s true that we won’t need to ask the IT team to add some dataLayer or ga pushes into the page, and therefore saving a lot of precious time, but in the other side, any single page update or testing will break our tracking.

Now I try to use data attributes whereas is possible, since those are more likely going to be kept for layout updates.

Checking elements for data attributes can be a tedious task, so I’m going to show you a little piece of code that I hope will make your life easier if you based some of your implementations on data attributes.

On this little snippet is where the magic happens:

(function() {
    
    var elements = [].slice.call(document.querySelectorAll('*')).filter(function(el) {
        if (typeof (el.dataset) != "undefined")
            return Object.keys(el.dataset).length != 0;
    });
    
    var data = [];
    var i = elements.length;
    
    while (i--) {
        var el = JSON.parse(JSON.stringify(elements[i].dataset));
        data.push(el);
        el["_element_type"] = elements[i].nodeName;
    }
    console.table(data);

})();

As an example I’m going to show you the output for Google Tag Manager‘s Homepage.

This has been a great time saver for me. Hope you find it useful too 🙂

Universal Analytics Plugin Online Hackathon – Dual tracking

I’ve been thinking about doing a Google Analytics related hackaton for a long time. Some months ago, I started to take a look about how Universal Analytics Plugins work and I decided that coding a plugin to all the data to a secondary property using just a plugin would be a real nice example.

For years now, I’ve sharing a lot of code that I’ve worked on, some tracking ideas too, but still I don’t consider myself a developer, if i must say it, I really think that I really suck at programming even if I can do some stuff myself.

So here I am trying to organize an online Universal Analytics Hackaton. I hope this can turn on a great change to learn from other people, and understand how plugins work!!!

Of course you may be asking what’s a “Hackathon” (don’t be shy about asking). Let’s quote the Wikipedia:

A hackathon (also known as a hack day, hackfest or codefest) is an event in which computer programmers and others involved in software development and hardware development, including graphic designers, interface designers and project managers, collaborate intensively on software projects. Occasionally, there is a hardware component as well. Hackathons typically last between a day and a week. Some hackathons are intended simply for educational or social purposes, although in many cases the goal is to create usable software. Hackathons tend to have a specific focus, which can include the programming language used, the operating system, an application, an API, or the subject and the demographic group of the programmers. In other cases, there is no restriction on the type of software being created.

GitHub Repository:

https://github.com/thyngster/universal-analytics-dual-tracking-plugin

For now I’ve pushed to the repository  with some “core” code, that “already” works.

How to load the plugin:

ga('create', 'UA-286304-123', 'auto');
ga('require', 'dualtracking', 'http://www.yourdomain.com/js/dualtracking.js', {
    property: 'UA-123123123213-11',
    debug: true,
    transport: 'image'
});
ga('dualtracking:doDualTracking');
ga('send', 'pageview');

Some stuff you need to take in mind when loading a plugin in Google Analytics:

  • The plugin needs to be hosted within your domain
  • It needs to be “initialized” AFTER the “create” method call and BEFORE the “pageview” method.
  • If for some reason the plugin crashes it may affect your data collection, please don’t use this in production before it has been fully tested.

Still it needs to be improved, for example:

  1. We don’t want to use global variables
  2. Payload size check, and based on the results send a POST or GET request
  3. Add XHR transport method
  4. Code cleanup/Best practises
  5. Plugin option to send a local copy for the hits
  6. Better debug messages
  7. Name convention improvement
  8. Any other idea?

Anyone is welcome to push code, add ideas, give testing feedback, through the Github repository or the comments on this blog post.

 

 

 

 

Keep your dataLayer integrity safe using Custom JavaScripts in Google Tag Manager

In JavaScript when you want to copy an object into another variable is not an easy as doing var myVar = myObjectVar; and you should be really careful when working with your dataLayer info in your customHtml Tags and your Custom Javascript Variables.

Let’s try to explain this is the best way I can. When  you’re doing that you’re not copying the current object data to a new variable but instead you’re pointing your new variable to the object one.

What does this mean?, that if that you change a value into your new variable that change will be reflected in the original one. Let’s see an example:

var OriginalData = {'a': 1, 'b':2};
var CopiedData = OriginalData;

CopiedData.a = "MEC!";

console.log("Old 'a' Value: ", OriginalData.a);

Before trying it in your browser console, could you please think what will be “mydata.a” value printed into the console?. If you’re thinking on a “1” value I’m sorry to say that you’re wrong:

 

You may be thinking, why “OriginalData.a” has changed if we only modified the value for our “CopiedData” object.

In programming you we can pass the data in 2 ways:

Call-by-value: This means that the data from the original variable will be copied/cloned into the new variable. 

Call-by-reference or Pass-by-reference: This means that the data on the new variable will be a pointer/reference to the original varialbe one. So if we want to print CopiedData.a , instead of returning a value, it will go to get the value to OriginalData.a (where CopiedData.a POINTS TO) .

How the data is passed in the different programming language is specific to each language, but let’s take a look on how JavaScript does it. Basically any variable type but the object will be called by value. If we do the same example as above, but instead of using an object we use a integer, we’ll be getting a different behaviour.

var OriginalData = 1
var CopiedData = OriginalData;

CopiedData = "MEC!";

console.log("Original Object 'a' Value: ", OriginalData);
console.log("Copied Object 'a' Value: ", CopiedData);

 

As you can see if the variable to be “cloned” is not an object, it will be “passed by value“.

So we need to take in mind that we may be overwriting the original object values. When working with GTM variables, this may equal with updating the original dataLayer values.

There’s not any in-built way to do a deep copy of an object in JavaScript. As we’re mostly refering to data, we could just stringify our object and then parse it again ( never use eval() for converting ).

So when trying to make a copy of some object from the dataLayer (for example when working on a Enhanced Ecommerce implementation and using variables to feed our hits). I would recomend doing it this way:

var ecommerce = JSON.parse(JSON.stringify({{ecommerce}}));

This will only work for objects not including functions/date values/etc. Just plain data. But for now it will keep our dataLayer integrity safe.

Just googleing a bit, you’ll find some functions around to make a full deep copy of an object, but we’re just working with data, so we’re not going to cover that at the moment.