1ST JUNE 2020 / WEB ANALYTICS
Tracking user's IP Autonomous System Number and Organization details to prevent the spam
Around end of 2019, Google Analytics dropped the
Network Domain and
Service Provider dimensions support from their reports making an official announment in February about it.
These 2 dimensions, where widely used to fight the spam in Google Analytics and there have been a lot of posts going around this topic in the last months. Simo Ahava wrote about how to collect the ISP data third party service in you want to check it.
On this post we'll learning what's an Autonomous System and how we could use this info to try to fight the spam. And coolest part is that we'll be able to use a free database for this. Continue reading :)
There are some other services and commercial databases that will provide this details, but let's be honest there're some big handicaps:
- If you use a free services, you will hit the limit quota quickly
- If you have a high traffic website this is not going to be cheap
There're basically 3 different types of subscriptions, SaaS ( they host the app and the database, DB ( you host the Database and the query system ), WebService.
I'm attaching a list of some of the providers available, in case you want to check them.
In any case there are a lot of posts around this topic on the web, and I'm trying to give this issue a new solution.
MaxMind provides their GEO LITE databases for Free , these database are updated weekly ( on Tuesdays to be exact ) and they provide info about:
The main difference on this databases with the paid ones is how accurate they are and how often they get updated. This accuracy may be an problem when we need to target users based on their city, but this time this is not what we're looking for, we'll looking at their ASN database.
If you are wondering ASN stands for Autonomous System Number. According to the Wikipedia:
An autonomous system (AS) is a collection of connected Internet Protocol (IP) routing prefixes under the control of one or more network operators on behalf of a single administrative entity or domain that presents a common, clearly defined routing policy to the internet.https://en.wikipedia.org/wiki/Autonomous_system_(Internet)
ASNs are a "big" routers on the ISPs and datacenters that are in charge of announcing the IP addreses they hold. ( sorry for this unaccurate description, trying to make this simple ) in order to let other AS to know how to reach their IP addreses.
Each ISP usually have their own ( they can have more than 1 ) . ASN. For example one of main ASN in Google is: AS15169 registered to Google LLC, and this Autonomous System manages 9.5 millions IPs from Google:
This means that we could query any IP address we and the ASN database will return their current
ASN that it belongs to.
For example we may query Google DNS's IP address: 18.104.22.168 and the database will return the AS number and the organization name:
[autonomous_system_number] => 15169
[autonomous_system_organization] => GOOGLE
Some other examples let's query for this Fastly CDN IP address 22.214.171.124
[autonomous_system_number] => 54113
[autonomous_system_organization] => FASTLY
Or let's query for an IP in a dedicated servers provide like LiquedWeb
[autonomous_system_number] => 32244
[autonomous_system_organization] => LIQUIDWEB
We could use the AS Number and the Organization names as a way to try to catch the spam, since most spam traffic is likely going to come from a co-location / vpn providers that we could identify this way.
Since it's a database we'll need to setup a small endpoint in our domain in order to be able to query it. This implies some IT development but in the other side it has some big wins:
There will be NO query limits.
The cost of having this solution running is the cost endpoint development
We could have our website developer querying this info via server-side and have this data pushed to the dataLayer instead of needing to have an extra XHR request and needing to delay the hits, YAY!
Now, in the order side of the road there some handicaps:
- Not as accurate data as network/domain in other databases
- Data freshness accuracy won't be premium, but as we all know GA wasn't either.
Getting the ASN DB
As I've mentioned above the GeoLite ASN database is free and you'll be able to get it after signup for a free account at : https://dev.maxmind.com/geoip/geoip2/geolite2/
Another good point is that MaxMind already provides libreries for
Perl and other languages to help on reading querying their GEOLite databases, which helps on setting up our endpoint.
As usual I'm providing a example for
PHP, since it's the most widly used language and the one that it's avaiable on almost any hosting around the world
If we don't have composer installed yet, that's gonna be our first step:
curl -sS https://getcomposer.org/installer | php
next, we'll be installing the needed dependences
php composer.phar require geoip2/geoip2:~2.0
$ip_as_details = new Reader('geo/GeoLite2-ASN.mmdb');
$asn_details = $ip_as_details->get('126.96.36.199');
// As this point we could build a JSON and send it back to the browser.
Last step will be passing back this info to Google Analytics using a custom dimension, so we can use it in our filters or segments.
Extra - Grabbing the network domain
I was about to publish the post and I decided to add a little extra , let's also learn how to track the "network domain" .
Google Analytics was using the IP's
PTR for the "network domain" . Again you may wonder what's "
PTR" , and it stands for "Pointer record" and it basically resolves an IP to a FQDN ( fully-qualified domain name ). This is it's the inverse of a A DNS Record.
For example we can make a Reverse IP Lookup to google DNS's and it will return "dns.google".
> set q=ptr
188.8.131.52.in-addr.arpa name = dns.google.
Or we may try with one Google Bot IP address, which most sea must be familiar
> set q=ptr
184.108.40.206.in-addr.arpa name = crawl-66-249-66-1.googlebot.com
Last example let's query google.com IP address
> set q=a
> set q=ptr
220.127.116.11.in-addr.arpa name = mad07s09-in-f14.1e100.net
If we want to have the network domain info back in our GA reports we'll just need to parse the hostname of the PTR for grabing just the root domain, on this last case it would be: 1e100.net .
I wouldn't advise about tracking to full ptr hostname for 2 reasons: First mosts of hostname are a mix of the IP address + a the ISP domain which will be agains the GDPR ( we cannot record the user's IP address ) and also it will create a high cardinality which won't help on analyzing the data.
Now, remember that we were building and endpoint in PHP to get the ASN details, just some more lines of data would allow to have the network domain pushed into our datalayer! :)
$ip_ptr = gethostbyaddr('18.104.22.168');
Dealing with getting the root domains, can be a pain task due to all the new domain tlds and needing to have in mind the third level tlds. In case you want to have this done easily you can use the following PHP library https://github.com/utopia-php/domains , which will let you grab the "registable" domain name within a hostname
$domain = new Domain('demo.example.co.uk');
$domain->get(); // demo.example.co.uk
$domain->getTLD(); // uk
$domain->getSuffix(); // co.uk
$domain->getRegisterable(); // example.co.uk
$domain->getName(); // example
$domain->getSub(); // demo
$domain->isKnown(); // true
$domain->isICANN(); // true
$domain->isPrivate(); // false
$domain->isTest(); // false
I'm providing the example in PHP language, but it doesn't mean you have to use it at all, this code/idea can be developed on almost any server-side language you may be using. In the last instance you run a small VM or VPS to have a PHP environment where you can host your endpoint :).