Securing Your DataLayer: Defending Against Pollution by External Tools and Bots to Preserve Data Integrity

As expected in the last months many vendors and providers has started to add Google Analytics 4 integrations, and many of them may just push events to the GTAG wrapper function, and you'll likely ending having some unwanted events on your reports.

Not only about vendors, spammers has an easy way to programatically mess with your data, just using the global objects.

We'll learn some implementation tricks for preventing any other tools to pollute our GA4 data and also how we can ensure that nobody else but us send data send data to our dataLayer ) , as usual I'll be using examples for Google Tag Manager and Google Analytics 4 but same logic could be applicable to any other tool.

Protecting GTAG from Bots and Vendors pollution

In order to protect our setup from unasked events or pushes, we'll slightly modify our GTAG calls. First modification is adding some guard check on the GTAG wrappermething blocked them.

<script async src="https://www.googletagmanager.com/gtag/js?id=G-THYNGSTER"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
    // Guard, first argument is not our protection token?, skip
	if (arguments[0] !== 'protectToken') return;

    // This function will allow us to pass back an Argument Type Objecto to the dataLayer
	function passArgumentsBack() {
		dataLayer.push(arguments)
	}
    // Remove the first argument and pass the rest back
	passArgumentsBack.apply(this, Array.prototype.slice.call(arguments, 1));
}
gtag('protectToken', 'js', new Date());
gtag('protectToken', 'config', 'G-THYNGSTER');
</script>

Now, any gtag function calls, where the first arguments is not our "protectionToken" will be blocked. Making that any vendor or bots that tries to push data to our namespace, will be just stealthy ignored

It may happen that you cannot modify the gtag on the creation, maybe because it's hardcoded on the page or because someone else has already initialized it. don't worry, you could run this code to override the current method.

if (window.gtag && {}.toString.call(window.gtag) === '[object Function]') {
	function gtag() {
		if (arguments[0] !== 'protectToken') return;
		function passArgumentsBack() {
			dataLayer.push(arguments)
		}
		passArgumentsBack.apply(this, Array.prototype.slice.call(arguments, 1));
	}
}

You have to remember that now you need to append your protectionToken to any call you want to push.

gtag('protectToke', ... )

Protecting our Datalayer from Bots and Vendors pollution

I bet you may have already thoughjt about just adding a custom event to all your pushes and then adding a blocking condition on GTM, and that's indeed a great idea, but at this time we're not trying to block events being push, but our dataLayer being polluted and messed up. We really want a clean and nice looking dataLayer which is fully governated by us, in anyone want to enter the play , should ask us first :).

Protecting the dataLayer is a bit more complicated ( at least Google Tag Manager one ), and this is because when GTM loads, it modifies the original push method from the array. That we'll be end losing the reactivity, or causing any other malfunction is we mess around with it, at the same and for the same reason, we cannot add the modification on the dataLayer initialization because it will be lost when GTM overrides it.

The sting we need to do here is wait until the dataLayer.push has been fully initialized and then add some method to intercept the calls beind made to it.

On this example I'll be using a simple proxy pattern , but there are some more proper ( and at the same time more difficult to implement ) workarounds like, working with setters and getters or using an ES6 Proxy. In any case this method is pretty straightforward an has a very good cross-browsers support.

I tried to focus on having an undestandable code rather than some cool looking code. We'll use a Promises based solution to query the dataLayer.push method until we detect it was already initilaized by Google Tag Manager and then we'll add our proxy

<script>
(function() {
	var settings = {
		dataLayerName: 'dataLayer',
		pollingTime: 25,
		limit: 1000,
		protectKey: 'ptoken',
		protectValue: 'thyngster'
	}

	var waitForDataLayerReady = function(settings) {
		var count = 1;

		function waitFor(result) {
			if (result) {
				var proxiedDataLayerPush = window[settings.dataLayerName].push;
				window[settings.dataLayerName].push = function() {
					if ((arguments && arguments[0] && arguments[0][settings.protectKey] && arguments[0][settings.protectKey] === settings.protectValue) || arguments && arguments[0] && arguments[0].event && String(arguments[0].event).match(/^gtm\./)) {
						if (arguments && arguments[0] && arguments[0][settings.protectKey] && arguments[0][settings.protectKey]) delete arguments[0][settings.protectKey]
						return proxiedDataLayerPush.apply(this, arguments);
					}
				}
				return settings.dataLayerName
			}
			if (count >= settings.limit) {
				return null;
			}
			count++;
			return new Promise(function(resolve) {
				setTimeout(resolve, settings.pollingTime || 1000)
			}).then(function() {
				Promise.resolve(window[settings.dataLayerName || 'dataLayer'] && window[settings.dataLayerName].push && window[settings.dataLayerName || 'dataLayer'].push.toString().includes('SANDBOXED_JS_SEMAPHORE'))
			}).then(function(res) {
				waitFor(res)
			});
		}
		return waitFor();
	}

	waitForDataLayerReady(settings).then(function(result) {
		if (result === true) {
			var proxied = window[settings.dataLayerName || 'dataLayer'];
			window[settings.dataLayerName || 'dataLayer'] = function() {
				return proxied.apply(this, arguments);
			}
		}
	});
})()
</script>

dataLayerName	This is our dataLayer variable name, will default to `dataLayer`
pollingTime	The polling period, by default it's checked every 25ms
limit	We don't really want to wait forever, limit + pollingTime will stop the watcher. If you want to calculate this in seconds the total seconds that the code will keep waiting for a dataLauyer is secs = (limit * pollingTime) / 1000
protectKey	This is the key we need to add to our pushes, if it's not present the push won't go throught
protectValue	And this is the expected Protect Token value

Settings Parameters Definition

In the other side our pushes should contain the protect key and token so they are allowed to end into the DL.

If you check the code carefully we added an special rule to allow all events starting with /^gtm\..*/ skip the check, to allow the system pushes to keep going into the dataLayer.

So now, if we someone does the following, the push will be intecerted and will never reach our dataLayer.

window.dataLayer.push({
    event: 'evilVendor',
    opted_in_groups: '1,2,3,4'
})

window.dataLayer.push({
    event: 'add_to_wishlist',
    ptoken: 'thyngster'
})

From this point on the logic could be extended as much as you want, for example you may want to defined a whitelist events list rather than working with a token, it's just up to your imagination.

This proxy pattern is extendible to almost any tools, meaning that you could this concept to any other vendor or TMS. Please take in mind that this is not a trivial stuff to add, so my advise is relying on your dev teams or some agency or contractor that can take some proper care or implmention this kind of solutions.