Sending personally identifiable information (PII) to Google Analytics is one of the things you should really avoid doing. For one, it’s against the terms of service of the platform, but also you will most likely be in violation of national, federal, or EU legislation drafted to protect the privacy of individuals online.
In this #GTMTips post, I’ll show you a way to make sure that any tags you configure this solution with will not contain strings that might be construed as PII. The tip is for Google Tag Manager, but with very little modifications it will work with Universal Analytics, too.
(UPDATE 8 September 2017: Check out Brian Clifton’s great extension of this solution: Remove PII from Google Analytics)
The Simmer Newsletter
Follow this link to subscribe to the Simmer Newsletter! Stay up-to-date with the latest content from Simo Ahava and the Simmer online course platform.
Tip 64: Remove PII from hits to Google Analytics
The solution hinges around customTask
, which has fast become my favorite new feature in the analytics.js library. See the following articles to understand why I think so:
Anyway, to make the whole thing run, create the following Custom JavaScript variable:
function() {
return function(model) {
// Add the PII patterns into this array as objects
var piiRegex = [{
name: 'EMAIL',
regex: /.{4}@.{4}/g
},{
name: 'HETU',
regex: /\d{6}[A+-]\d{3}[0-9A-FHJ-NPR-Y]/gi
}];
var globalSendTaskName = '_' + model.get('trackingId') + '_sendHitTask';
// Fetch reference to the original sendHitTask
var originalSendTask = window[globalSendTaskName] = window[globalSendTaskName] || model.get('sendHitTask');
var i, hitPayload, parts, val;
// Overwrite sendHitTask with PII purger
model.set('sendHitTask', function(sendModel) {
hitPayload = sendModel.get('hitPayload').split('&');
for (i = 0; i < hitPayload.length; i++) {
parts = hitPayload[i].split('=');
// Double-decode, to account for web server encode + analytics.js encode
try {
val = decodeURIComponent(decodeURIComponent(parts[1]));
} catch(e) {
val = decodeURIComponent(parts[1]);
}
piiRegex.forEach(function(pii) {
val = val.replace(pii.regex, '[REDACTED ' + pii.name + ']');
});
parts[1] = encodeURIComponent(val);
hitPayload[i] = parts.join('=');
}
sendModel.set('hitPayload', hitPayload.join('&'), true);
originalSendTask(sendModel);
});
};
}
Once you add this variable to your Universal Analytics tags as the customTask
field, any hits sent by these tags will be parsed by this variable, which replaces the instances of PII with the string [REDACTED pii_type].
At the beginning of the code snippet, you’ll see the configuration object piiRegex
. It’s an array of object literals, where each object has two properties: name
and regex
. The first is what will be used in the replace string after “REDACTED”. So if name
is “EMAIL”, you’ll see “[REDACTED EMAIL]” in your Google Analytics reports wherever PII data was removed.
The second parameter, regex
, is where you’ll add the regular expression literal for whatever PII pattern you want to redact. In the example above, I have two patterns:
-
/.{4}@.{4}/g
- this matches all @ symbols plus the four preceding and four following characters. So if ANY part of the payload (URL, Custom Dimension, Event Label, etc.) has the @ symbol, then the string will be obfuscated. Thus,simo.s.ahava@gmail.com
becomessimo.s.a[REDACTED EMAIL]l.com
. -
/\d{6}[A+-]\d{3}[0-9A-FHJ-NPR-Y]/gi
- this is a reasonably good abstraction of the Finnish personal identity code. It’s not perfect, because the personal identity code is actually a calculation, so you can’t use simple pattern matches to only find valid codes. This regular expression will probably result in many false positives, especially if your GA hits include UUIDs or any type of alphanumeric hashes. But it’s still better than collecting this sensitive data.
You can add your own regular expression patterns as new objects of the array.
When you add this variable into the customTask
field of a Universal Analytics tag, the code will run through the entire payload, looking for matches to the regular expressions you provide in the configuration array. If any matches are made, they are redacted.
Do you have other, useful regular expressions for finding and weeding out personally identifiable information?