Skip to content

SEO meets GA: Tracking search bots visits within measurement protocol

:: 14 Comments

I’ve been attending lately (and having) to some talks about the logs parsing from the SEO perspective, (from @David Sottimano on Untagged Conference and Lino Uruñuela during some dinner time), and I’ve decided to publish a WordPress plugin that I started to work on some years ago, and that for work reasons I had it left on my “I’ll do it” drawer and it never came back to my mind.

First thing I need to the point to, is that this is a BETA PLUGIN, so please careful of using it on a high load trafic or on a production site. I’ve running on this site for 4 days without any problems, but that doesn’t mean it’s free of bugs. Let’s consider this plugin for now as a proof of concept.

The main task of the plugin is to register the search bots visits to our wordpress site into Google Analytics, using the Measurement Protocol.

The working flow of the plugin is easy, it just checks if the current visiting User Agent is matching any known Crawler, and based on that info we’re sending a pageview to some Google Analytics Property. Please take in mind that it’s recommended to use a new property since, we’re going to use a lot of custom dimensions to track some extra info beside the visited pages =)

I used to had my own User Agents parser, but I ended using another well stablished (and for sure more reliable) library. When something works there’s no need to reinvent the wheel :). So this pluggin uses the PHP library for the uap-core project.

Let’s see a simple flow chart about what the plugin does:

I’m sure this was easy enough to understand. But don’t only want to check what pageviews were visited by a search bot, no we’re going further and we’ll be tracking the following:

And for sure you may find replies to a lot of more questions, since we’re using Google Analytics to track those visits, we’ll able to cross any of the dimensions at our needs.

Another cool thing of tracking the bots crawls within the Measurement protocol, is that we’ll be able to watch how our site is being crawled in the real time reports! 🙂

Setup

You’ll just need to download the plugin zip file from the following url, and drop it in your WordPress Plugins folder and configure the Google Analytics Property ID to where you want to send your data.

Used Custom Dimensions

You may be wondering why do we have the same bot info related dimensions duplicated and with a different scope, this is why because as I explained before we’re using the bot IP address to build up a clientID and an userID, and it may happen that Google uses the same ip for different bots (like for Desktop or Featured Phone). This way we can have the hit level info too in the case that user scope data get’s overriden 🙂

Another thing we may want to do, is to setup the session timeout limit to 4 hours within our profile configuration. Bots Crawls are not done the same wht as an user navigates the page, and we may be getting 2 pages hits per hour, so the default 30 minutes timeout makes not sense at all.

Let’s know see how the reports will look on Google Analytics 🙂

Consumed content by bots with an hourly breakdown

Total sessions and pageviews by search bot

Pages that returned an 404 and which bot was crawling it

Which pages did a certain bot crawled (User Explorer Report)

You can get the plugin from the following GitHub repository:
https://github.com/thyngster/wp-seo-ga

If you are unable to run the plugin, please drop me a comment on this post or open an issue on GitHub and I’ll try to take a look to it.

Any suggestions/improvement will be very welcome too 🙂

Published inweb analytics

14 Comments

  1. I am getting the following warning for a standard WordPress install:

    Warning: require_once(vendor/autoload.php): failed to open stream: No such file or directory in

    • did you check that all files were uploaded right?. I tested it on a standalone debian Apache + PHP setup and on a Plesk 13 configured server and it worked in both.

      In any case I’m trying to get more feedback of people using the script on shared hosting enviroments to make the plugin more fail proof. I appreciate it if you could give more details about your hosting 🙂

  2. Love this. Thanks for sharing it. Could you make your custom dimensions available too? For those of us not quite as skilled at GA 🙂

  3. Hello David, I also get an error by activating the plugin:

    Parse error: syntax error, unexpected ‘[‘ in /var/www/web522/html/domain.de/wp-content/plugins/wp-seo-ga-master/wp-seo-ga.php on line 179

    • Olaf, may you please share with me the PHP Version you’re using?. I noticed that the plugin may fail too if using some caché plugins (working on that), if that’s the case could you try disabling it for testing?

    • am I right, guessing you’re PHP version is PHP <5.4 ?
      PHP 5.3 EOL was on 14 Aug 2014, and last relased is dated from 2013.
      I recomend you upgrading your PHP version 🙂

      Anyway I pushed an update that should have fixed your problem.

  4. kope kope

    How can i use the workaround for a normal website? Without wordpress?

  5. Sam Sam

    Hey,

    Sounds cool but it doesn’t seem to be working for me! I am using shared hosting on WPEngine. May be related to the pretty aggressive caching they do?

    • Surely is because of that :/. Do you have any warning error log that we can check to see what’s going on?

      • Sam Sam

        I edited the log slightly to remove the domain, here is the error log I am getting:

        PHP Warning: file_get_contents(?v=1&t=pageview&dl=https%3A%2F%2Fwww.domain.co.uk%2Fproduct%2Fcongratulations-its-a-boy%2F&ul=&de=UTF-8&dt=Congratulations+It%26%23039%3Bs+A+Boy+%7C&cid=3613367a-fd8-817-664-3d70f8083c060ff&uid=2760155460&tid=UA-86863572-2&ds=wp-seo-ga&a=9056766&z=6790272&cd1=AhrefsBot&cd2=Spider&cd3=Desktop&cd4=Mozilla%2F5.0+%28compatible%3B+AhrefsBot%2F5.2%3B+%2Bhttp%3A%2F%2Fahrefs.com%2Frobot%2F%29&cd5=200&cd6=3613367a-fd8-817-664-3d70f8083c060ff&cd7=2760155460&cd8=70&cd9=1.852&cd10=11&cd11=AhrefsBot&cd12=Spider&cd13=Desktop&cd14=https&cd15=HTTP%2F1.0&cd16=ahrefs.com): failed to open stream: File name too long in /nas/content/live/[site]/wp-content/plugins/wp-seo-ga-master/wp-seo-ga.php on line 173

        Looks like something is wrong with this bit of code:
        $file = file_get_contents($hitPayload, false, $context);

          • Sam Sam

            That fixed it, thanks!

            Also, just a suggestion. I much prefer when things like this are hidden away in the admin area as a sub-menu of the settings menu. For a plugin that would rarely get revisited once setup it seems strange for it to have its section in the main menu.

            Thanks again!

  6. I was not aware of this David. On the other hand I did something similar on non WP site detecting bots/crawlers. Nice approach

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.