The Data Layer

An overview of the Data Layer in Google Tag Manager, and a detailed description of its use and application in websites.

Writing this article is dangerous. Data Layer is two marketers short of becoming a buzz word. This occasion will be heralded by articles such as “Data Layer Is Dead”, “This Developer Implemented A Data Layer And You’ll Never Guess What Happened Next”, and other examples of the kind of content generation whose propagation should be prevented by military force. This is not one of those articles, I hope, but rather an honest look at what Data Layer is from a number of perspectives.

And there are many perspectives, indeed. The terminology itself is difficult to pin down. In this article, I will consider Data Layer to comprise the following definitions:

  • The description of business requirements and goals, aligned in a format that is readily transferrable to technical specifications

  • The concept of a discrete layer of semantic information, stored in a digital context

I will also use the variable name dataLayer to denote the data structure used by Google Tag Manager for storing, processing, and passing data between the digital context and the tag management solution. I also prefer the term digital context to website, for example, since the Data Layer can be used in a variety of context, not just a public-facing web environment.

The Data Layer most explored in this article is the one that is firmly rooted in the DMZ between developers and marketers. It’s very much a technical concept, since its existence is justified by the limitations imposed by certain web technologies (JavaScript, for example) upon how browsers interact with applications (Google Tag Manager, for example). At the same time, Data Layer is, and ought to be, owned at least partly by marketers, analysts, executives, designers, and communication professionals, who draft the business requirements and goals that are satisfied by data collection methods.

In other words, it’s very common that the governance of Data Layer is debated hotly among different stakeholders of the “data organization” within a company. Since, as we will learn, it’s a generic data model that can be used by all applications that interface with your digital data, it’s very difficult to draft a governance model that would satisfy all parties. This, too, we’ll explore in this post.

In the end I’ll share some great resources for learning more about Data Layer, since this post will not be a deep-dive (even though it is wordy).

What Is The Data Layer

To put it shortly, a Data Layer is a data structure which ideally holds all data that you want to process and pass from your website (or other digital context) to other applications that you have linked to.

The reason we use a Data Layer is because sometimes it is necessary to decouple semantic information from other information stored in the digital context. This, in turn, is because if we reuse information already available, there’s a risk that once modifications are done to the original source, the integrity of the data will be compromised.

A very common example is web analytics tracking. You might have a Data Layer which feeds data into your analytics tool about the visitor. Often, this data isn’t available in the presentational layer, or in the markup at all. This data might be, for example, details about the visitor (login status, user ID, geolocation), metadata about the page (optimal resolution, image copyrights), or even information that’s already in the markup, but that you want to access in a more robust way.

This duplication is often seen in eCommerce data. Instead of “scraping” transactional details from the header or content of the page, it’s more reliable to use the Data Layer to carry this information, since only this way is the data uncoupled from the website proper, meaning it is less subject to errors when markup is modified.

If, for example, you were inclined to use data stored in a H2 heading of the HTML markup in the thank you page, a single change to the markup or the format of the information in this HTML element would compromise data collection from the site to your tracking tool. If, however, the data were stored in a Data Layer with no link to the presentational layer, there is a far smaller risk of unexpected changes occurring (though it’s definitely not impossible).

So, in short, the Data Layer is a data structure for storing, processing, and passing information about the context it exists in.

The Data Layer: The Non-Technical Perspective

For the marketer, the analyst, the executive, the communications officer, or other non-developer, the Data Layer is actually a list of business requirements and goals for each subset of the digital context.

For a web store, for example, business requirements and goals might include transactional information (what was purchased), user data (who made the purchase), spatial and temporal details (where was the purchase made in and at what time), and information about possible micro conversions (did the user subscribe to product updates).

For another part of the same website, the business requirements and goals might include simply details about which social media channel brought the user to the website, or which pages the user has viewed more than once.

These are not technical specifications, but clearly defined lists of items that need to be collected in order to satisfy the business goals set for each business area of the website or other digital context.

Ideally, the Data Layer carries information which can be used by as many different tools / users / stakeholders as possible, but it’s very common that idiosyncrasies emerge. This is why it’s extremely important to treat the Data Layer as a living, agile model, not a stagnated, monolithic, singular entity.

Similarly to any aspect of digital analytics, a Data Layer should also be treated as something that’s constantly in flux. The data it holds must be optimized, elaborated, divided, conjoined, cleaned, refactored, and questioned as often as new business requirements emerge, or when previous goals were not beneficial to the business.

Google Tag Manager’s dataLayer

Since there’s no existing standard for the data model explored in this article (the effort is under way, though), the Data Layer can have many technical guises. The technical perspective I’ve chosen is the one that has evolved through Google Tag Manager. This is because I think, and I’m only slightly biased here, that dataLayer is one of the more elegant implementations of a structured data model in the web environment.

dataLayer is a JavaScript Array, which holds data in key-value pairs. The key is a variable name in String format, and values can be any allowed JavaScript type. This is an example of dataLayer with different data types:

dataLayer = [{ 
    'products': [{ 
            'name': 'Kala Ukulele',
            'tuning': 'High-G',
            'price': 449.75
        },{
            'name': 'Fender Stratocaster',
            'tuning': 'Drop-C',
            'price': 1699
    }],
    'stores': ['Los Angeles', 'New York'],
    'date': Sat Sep 13 2014 17:05:32 GMT+0200 (CEST),
    'employee': {'name': 'Reggie'}
}];

Here we have values such as an Array of objects (the products), numerical values (price), an Array of Strings (stores), a date object, and a nested object (the employee name).

The point here is that dataLayer is generic and tool-agnostic. As long as it behaves like your typical JavaScript Array, it won’t be restricted to just one tool. The information in the dataLayer object above can be used by any application which has access to the global namespace of this page.

How the data within this Array is processed is thus left to the tool. In Google Tag Manager, for example, an intermediate helper object is used to process data in dataLayer, which is then stored in an internal, abstract data model within the tool itself. This ensures that dataLayer can stay generic and tool-agnostic, but the data within is processed to comply with the idiosyncratic features of Google Tag Manager.

The helper object used by Google Tag Manager has a number of interesting features, such as:

  • A listener which listens for pushes to dataLayer. If a push occurs, the variables in the push are evaluated.

  • Get and set methods which process / manipulate dataLayer as a queue (first in, first out), and ensure that the special values (objects, Arrays) within the data model can be updated and appended correctly.

  • The ability to access commands and methods of objects stored in dataLayer, and the possibility of running custom functions in the context of the data model.

These are all transparent to Google Tag Manager’s users, of course, but they explain why, for example, the Data Layer Variable Macro can access dotted variable names (gtm.element) and properties (gtm.element.id) equally, and also why you can push multiple values with the same key into dataLayer but only the most recently pushed value is available for tags which fire after the push.

Since the abstract data model within Google Tag Manager only respects the most recent value of any variable name, the organization must decide where and when Data Layer as a business component becomes dataLayer the Array structure. This is the topic of the next chapter.

From Business Goals To Technical Implementation

The most common approach, I believe, is that the business requirements and goals are translated into a set of key-value pairs, which must be rendered / deployed by server-side code, so that dataLayer is populated with all the necessary data before the GTM container snippet loads.

Naturally, you could do it with client-side code, and it doesn’t have to be pre-populated, but business-critical data is best secured if it’s rendered into dataLayer at the earliest possible moment in the page load, so that data loss is minimized if the user decides to leave the page before dataLayer has rendered.

Here’s an example. We have a page with the following business requirements that we want to track as business goals:

  1. User ID - because we want to track the entire user journey, not just session-by-session or device-by-device

  2. Internal user - because we want to filter out our own employees’ traffic from the data

  3. Weather at time of visit - because we want to see how weather affects visit behavior

This is a simple, albeit nonsensical, list of business requirements that have a direct impact on how we track goals for this part of the website. This list needs to be appended with more information, such as what are example values for these variables, what is their scope (hit, session, user, product, for example), should they persist (stay on from page to page), and so on. I won’t do this now, since it’s very much up to how your organization handles projects which span across different departments or business domains.

Anyway, an example of dataLayer, rendered before the container snippet, might look like this:

<script>
    window.dataLayer = window.dataLayer || [];
    dataLayer.push({
        'userId' : 'abf5-3245-ffd1-23ed',
        'internalUser' : true,
        'weather' : 'Cloudy'
    });
</script>
<!-- GTM Container Snippet Code Here -->

As you can see, the data is rendered before the GTM container snippet, so that all tags that fire as soon as GTM is loaded can use this data.

Do note that you can and will use dataLayer within the confines of Google Tag Manager as well, since your tags or other on-page libraries might well push data into the structure after this pre-load sequence. I don’t think these dynamic pushes or data exchanges need to be documented as carefully, since they occur solely in the domain of the tool that does the pushes. Thus, documentation and version control is left up to the sophistication of the tool itself.

The reason you need to put a lot of thought behind the pre-rendered dataLayer is because each new stakeholder makes the question of governance a bit more complex.

Governance Of The Data Layer

Coming up with a good governance model is difficult. Coming up with one for a data structure which is at the mercy of a number of different parties, all with varying levels of expertise (and general interest), is even more difficult.

Nevertheless, a well-defined, structured, and formalized governance model is probably the one thing that will prevent your analytics organization from imploding due to missteps in operating with a Data Layer.

A governance model, in this context, is a document (or documentation) which describes as clearly as possible the Data Layer, its parts, the business domains it’s deployed in, its various owners, its version history, its variables, how risk management is handled, etc.

This is a very fluid concept, and it really depends on the organization how they want to organize themselves around this project, but ideally this is the kind of governance model I’d be happy to work with:

I Introduction

  • Purpose of the document
  • Who this document is for
  • Table of contents

II Version History

  • What was revised
  • When it was revised
  • By whom it was revised

III Ownership

  • What does ownership mean
  • Who owns the process
  • What are the rights and privileges of the owner

IIIa Stakeholders

  • Who have a stake in Data Layer (tools, platforms, departments, agencies, third parties)
  • What is their role
  • What are their rights and privileges

IIIb Technical Specifications

  • Who owns the technical Data Layer (IT, most often, or a very enlightened marketer)
  • What is their role
  • What are their rights and privileges

IIIc Management

  • Who owns the business requirements (head of marketing, or some similar role in the client organization)
  • What is their role
  • What are their rights and privileges

IV Process Distribution

  • What parties use Data Layer
  • What are their special requirements
  • How to avoid conflicts between different stakeholders

V Risk Management

  • What are the risks
  • What is their severity
  • What is their probability
  • Who owns the risks (and any actions taken to mitigate them).

VI Data Layer Management Model

  • How to plan for updates
  • How to implement updates
  • Who deploys the updates
  • Who tests the updates
  • Who needs to be notified
  • Who updates the document
  • How to avoid conflicts

VII Data Layer Technical Description

  • What is the underlying data structure
  • How is this structure translated into each tool’s own data model
  • Are there reserved variable names or other potential sources of conflict

VIII Data Layer Variables

  • Business requirements translated to data layer variables
  • Sorted by business domain
  • Example values, scope, parameters, expected types
  • Where the data comes from
  • How the data is used
  • And so on…

I know, it looks horrible. And probably unusable for many. However, having a document like this that is also constantly updated not only provides you with some contractual security, but it also keeps everyone up to date on the most recent structure and format of Data Layer.

Does this document need to be consulted / updated when you create a new JavaScript code snippet which calculates the number of images on the page?

Probably not.

Does this document need to be consulted / updated when you’re implementing a conversion pixel which also uses Transaction Value?

Most likely.

Does this document need to be consulted / updated when you’re deploying enhanced eCommerce?

Absolutely.

It doesn’t have to be larger than life or a huge complication. Just have some concrete description of Data Layer available at all times, and at the very least, agree in writing on how the Data Layer is updated and by whom. This way you’ll save a lot of trouble in the long run, when unwarranted changes are about to happen.

Conclusions And Further Reading

I think Data Layer is a very difficult concept to grasp. This isn’t just because for most it’s a technical thing, but because most don’t realize it’s also very much a list of business requirements.

Translating business goals to a well-formed, lean, and 100 % utilized Data Layer is really difficult. I honestly think one of the biggest mistakes is to follow the waterfall model, where a huge list of requirements is jotted down in the beginning of the project, then translated into a Data Layer which appears on every single page on the site, and after that point the structure is never touched again.

This doesn’t work.

The waterfall model is flawed thanks to human fallibility. We simply can’t design or predict the final shape of something as vast as an entire layer of semantic data, which might cover almost every single aspect of our digital context. It has to be agile. There has to be a mutual understanding that the shape of the layer becomes clearer with time.

Start small and scale up, if you have time. If you’re in a rush, focus solely on the business-critical requirements.

Whatever you do, make sure there’s a process in place which lets you suggest modifications to Data Layer quickly. This requires a lot of lubrication, education, and knowledge transfer. That’s why I think the most important thing in any data project is to start with educating all parties about what the other parties are doing in the project. Make the marketers more development-minded and the developers more respectful of your marketing efforts.

That way everyone wins, and you’ll have a beautiful Data Layer in no time.

Further reading:

P.S. If anyone knows a really good article about governance of semantic data, I’d love to read it and link to it in this post.