For a long time, I considered standard Google Analytics reports to be the best way to get useful insights. From time to time, I struggled with sampling, limitations, and weird results, but I didn’t see a way around it—until I discovered Google Analytics 360 and raw data exports into Google BigQuery.
After a few hours playing around with SQL, I was already able to deliver insights I never could have with aggregated Google Analytics reports. Since that day, I’ve been exploring how raw data can be a web analyst’s best friend.
Now that more and more tools allow you to access raw data—and cloud storage has become more powerful—web analysts should shift focus, too.
In this article, I will:
- Introduce you to the differences between raw and aggregated data;
- Show you what to expect out of this new access;
- Demonstrate how you can get your hands on this data (cheaply);
- Give you practical examples with BigQuery to apply to your favorite dataset.
Table of contents
What’s the difference between raw and aggregated data in Google Analytics?
Google Analytics, in the free version, provides only aggregated data. That means you can’t get all information pageview by pageview, event by event. Obviously, you can get a lot of information through the User Explorer report, but this is limited, not scalable, and not downloadable.
For regular use, aggregated data is usually enough. After all, most questions we answer are pretty basic:
- Which campaigns bring more conversions?
- Do more and more people use Feature X?
- Where do my users come from?
- Which devices do they use?
Answering these questions doesn’t require raw data. Default or custom reports in Google Analytics do the job. So why dive deep into millions of rows of precise data?
The problem with aggregated data is that, well, it’s aggregated—you’re mixing together a bunch of user behavior, sometimes hiding the most interesting facts.
Let’s take a simple example with pages per session. Say you have two sources with six sessions each that have the following number of pages per session:
- Source A: 1; 1; 2; 2; 2; 10;
- Source B: 2; 2; 3; 3; 4; 4.
Looking at the raw data, you can see that if you omit the statistical outlier of 10 pages, Source A has far less engagement. However, if you check only average data, it’s the same as Source B: 3 pages. (The median would be different.)
So why isn’t this more granular approach the default for Google Analytics? Because of calculation costs. When you provide only sampled, aggregated data, you don’t need to go through millions of rows on every report. It makes sense that the free version of Google Analytics doesn’t provide free advanced calculations.
Now, they can bill you based on actual data usage in BigQuery. And, as web analysts, we need to be fully aware of exactly what you get for that investment.
What can you get out of raw data?
Now that you see the limitations of aggregated data, let’s walk through a few use cases for raw data:
1. Event timing
One thing that’s seriously missing from Google Analytics? Timing. There’s no simple way to know the actual time interval between an add-to-cart and a purchase, whether within the same session or not.
Of course, you can store some timings in a cookie and do your own calculations. But this feels like reinventing the wheel—Google Analytics is supposed to have collected this data already!
With raw data analysis, you can easily get the precise timing of an event for a given user and compare it to another event for the same user. For a larger analysis, you can aggregate data any way you decide: average, median, percentile distribution, or some advanced statistical model.
Isn’t it important to know that 20% of your users convert within 2 minutes and 10% take more than 7 days? Don’t you think you should communicate differently with these two groups of visitors?
2. User scope analysis
In Google Analytics reports, even Google Analytics 360, user segmentation is limited to 90 days. For some businesses, especially ones with a long decision process, this look-back window isn’t enough.
With raw data, you can answer questions like:
- Are users acquired during the holiday season more likely to buy in September than other types of users?
- What effect does watching a video have over the course of a year? Does it help with conversions?
If you store raw data, you can keep event logs for as long as you want. Just check with your data protection officer if the duration conforms with the purpose of treatment.
3. Correlation
A correlation coefficient shows the statistical relationship between two variables. With large data, it can be insightful to measure the relationship between two behaviors, such as:
- The impact of pageviews per topic on purchase. Is there a correlation between types of content a user reads and what they purchase?
- Related products. If I buy Product A, what category of product has a positive correlation with this product?
4. Third-party data
Last but not least, storing raw data can be a total game-changer if you join forces with other data sources. Here are a few examples:
Ecommerce data. This is especially valuable if you store a Google Analytics’ client ID along with any add-to-cart or checkout actions.
You can calculate a more accurate conversion rate as you can get transaction information even for users who didn’t trigger Google Analytics on the confirmation page (e.g., ad blockers prevent tags from firing, bank services don’t redirect, confirmation pages take too long to load, etc.).
Further, as you’re now using your own data, you can remove revenue from cancelled transactions or returns. You can also calculate more advanced and confidential metrics, such as margin instead of revenue.
CRM data. What’s more annoying than realizing that a campaign that generated a bunch of leads generated a bunch of irrelevant leads? This is a challenge for most B2B sites.
As long as you can export your CRM data with a unique lead ID (e.g., client ID, SHA-256 hashed email, generated ID, etc.), it’s fairly easy to pair that data with a Google Analytics client ID and calculate conversion rates (not just lead-generation rates) for your campaigns.
For multi-channel analysis, you may have to do a more advanced query, but the good news is that you have total control over how it’s calculated.
Offline events. Online businesses are impacted by the offline context—holidays, weather, strikes, or a deadly virus forcing half the world into a lockdown.
In Google Analytics, there’s no way to add a custom dimension to a specific date range. Annotations are for the user interface, not calculations.
However, wouldn’t it be nice to know the impact of holidays on your business? If you sell globally, wouldn’t you want to know which countries’ sales are most impacted by bank holidays?
Obviously, for such analysis you would have to gather information in a data-readable format. But once it’s done, you have a world full of relevant information to share.
Ads, crawlers, logs—you name it. Once you get used to storing all your data in the same warehouse and running join analyses with analytics data, you can let your wildest data-dreams come true:
- Does longer content perform better? A proper crawler (e.g., Screaming Frog) could allow you to compare pageviews and content length.
- Are crawl anomalies affecting SEO? By storing your log data in BigQuery, you can check for correlations between Googlebot’s visits and changes to search performance.
- What’s the real ROI of your marketing? Set up your own attribution calculations to measure the return on ad spend across all ad platforms.
Tools to get raw analytics data
Now that you’re aware of the benefits of raw analytics data, I’ll show you a few ways to get that data into a data warehouse.
Google Analytics 360
If you’re rich/lucky enough to get your hands on Google Analytics 360, you get an export of raw data to Google BigQuery right out of the box.
All information, including enhanced ecommerce, is exported. Every row represents a session, and you get to play with a lot of dimensions and metrics.
Google Analytics App+Web and Firebase
Very good news—you can now export to Google BigQuery for the web even though you’re not paying for 360. Indeed, Firebase, the core of Google Analytics App+Web, allows exports to Google BigQuery.
To do it, you just have to switch to Blaze invoicing, which has a pay-as-you-go scheme. For a large website, you may have to monitor your budget, but for most sites you pay only a few dollars per month (even zero if you have a small website).
Every row represents an event, which includes a screen/pageview. You have to get used to the very specific way that Firebase presents the data; it’s different than what you’re used to in Google Analytics.
Still, this is a solution I totally recommend to start working with raw data.
Other free tools: Yandex.Metrica and Matomo
I haven’t tried every tool. Most paid tools offer a raw data export. I would like to mention two free tools, however, that also have this option.
- Yandex.Metrica is 100% free and provides raw data through their logs API.
- Matomo is an open-source analytics tool that may need to be installed on your own server—you get raw data right in your database.
Data pipeline
Another way to get Google Analytics right into a data warehouse is to use a pipeline. OWOX BI has a pretty solid pipeline from Google Analytics to BigQuery. It works through a custom task in Google Analytics. It basically copies Google Analytics’ entire payload to their own endpoint.
If you’re savvy enough, you can also create your own endpoint using cloud functions or log analysis. I recommend two articles to inspire you:
- How to build a GTM monitor by Simo Ahava. You’ll learn about sending data to BigQuery using Cloud Functions. The limitations are 100,000 rows per second integrated in BigQuery. If you’re above that, you may have to batch hits from some logs.
- Serverless Pixel Tracking Architecture in Google Cloud Help Center. There’s a process to create your own tracking pixel with an integration into BigQuery.
Examples and case studies with BigQuery
Now that you know the benefits of working with raw data and how to access that data, let’s go through a couple example that really showcase what you can do.
Correlation between topics and transactions in Google Analytics 360
- Context: A newspaper with online subscriptions.
- Goal of the analysis: Do topics read by users correlate with transactions?
Results and conclusion
corr_culture 0.397
corr_opinion 0.305
corr_lifestyle 0.0468
corr_sport 0.009
The more articles a user reads about culture and opinion, the more likely they are to be interested in a subscription. On the other hand, lifestyle and sports sections are less likely to lead to subscriptions.
Query on BigQuery
SELECT
CORR(culture,transac) AS corr_culture,
CORR(opinion,transac) AS corr_opinion,
CORR(lifestyle,transac) AS corr_lifestyle,
CORR(sport,transac) AS corr_sport
FROM(
SELECT
SUM(IF(hit.page.pagePath LIKE'/culture%',
1,
0)) AS culture,
SUM(IF(hit.page.pagePath LIKE'/opinion%',
1,
0)) AS opinion,
SUM(IF(hit.page.pagePath LIKE'/lifestyle%',
1,
0)) AS lifestyle,-
SUM(IF(hit.page.pagePath LIKE'/sport%',
1,
0)) AS sport,
COUNT(hit.transaction.transactionId) AS transac
FROM
`mydatabase.view_id.ga_sessions_*`,
UNNEST(hits) AS hit
WHERE
_TABLE_SUFFIX BETWEEN '20191201' AND '20200301'
GROUP BY
fullVisitorId
ORDER BY
transac DESC
)
Cohort analysis with Firebase
- Context: An app with regularly updated content and high seasonality.
- Goal of the analysis: How do users behave after the first install? When is it the best time to attract users who will come back regularly?
Results and conclusion
month | september_cohort | october_cohort | november_cohort | december_cohort |
---|---|---|---|---|
201909 | 2228 | 2 | 3 | 4 |
201910 | 436 | 2233 | 3 | 7 |
201911 | 264 | 369 | 2076 | 2 |
201912 | 256 | 247 | 370 | 2057 |
202001 | 196 | 177 | 261 | 412 |
202002 | 151 | 158 | 179 | 229 |
202003 | 222 | 188 | 201 | 241 |
Users who first opened the app in September and December seem to have the highest recurring usage.
Query on BigQuery
# change my-app.analytics_123456789 to your ID
WITH cohorte_september
AS
(
WITH
user_september AS
(
SELECT DISTINCT user_pseudo_id AS user
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20190901' AND '20190930'
AND
event_name="first_open"
)
SELECT sessions, month
FROM
(
(SELECT COUNT(DISTINCT user_pseudo_id) AS sessions,"201909" AS month
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20190901' AND '20190930'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_september)
)
UNION ALL
(SELECT COUNT(DISTINCT user_pseudo_id),"201910"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191001' AND '20191031'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_september)
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201911"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191101' AND '20191130'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_september)
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201912"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191201' AND '20191231'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_september)
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202001"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200101' AND '20200131'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_september)
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202002"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200201' AND '20200229'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_september)
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202003"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200301' AND '20200331'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_september)
)
)
ORDER BY month ASC
),
#october
cohorte_october AS
(
WITH
user_october AS
(
SELECT DISTINCT user_pseudo_id AS user
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191001' AND '20191031'
AND
event_name="first_open"
)
SELECT sessions, month
FROM
(
(SELECT COUNT(DISTINCT user_pseudo_id) AS sessions,"201909" AS month
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20190901' AND '20190930'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_october )
)
UNION ALL
(SELECT COUNT(DISTINCT user_pseudo_id),"201910"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191001' AND '20191031'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_october )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201911"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191101' AND '20191130'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_october )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201912"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191201' AND '20191231'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_october )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202001"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200101' AND '20200131'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_october )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202002"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200201' AND '20200229'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_october )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202003"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200301' AND '20200331'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_october)
)
)
ORDER BY month ASC
),
#november
cohorte_november AS
(
WITH
user_november AS
(
SELECT DISTINCT user_pseudo_id AS user
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191101' AND '20191130'
AND
event_name="first_open"
)
SELECT sessions, month
FROM
(
(SELECT COUNT(DISTINCT user_pseudo_id) AS sessions,"201909" AS month
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20190901' AND '20190930'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_november )
)
UNION ALL
(SELECT COUNT(DISTINCT user_pseudo_id),"201910"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191001' AND '20191031'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_november )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201911"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191101' AND '20191130'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_november )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201912"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191201' AND '20191231'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_november )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202001"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200101' AND '20200131'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_november )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202002"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200201' AND '20200229'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_november )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202003"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200301' AND '20200331'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_november )
)
)
ORDER BY month ASC
),
#decembre
cohorte_decembre AS
(
WITH
user_decembre AS
(
SELECT DISTINCT user_pseudo_id AS user
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191201' AND '20191231'
AND
event_name="first_open"
)
SELECT sessions, month
FROM
(
(SELECT COUNT(DISTINCT user_pseudo_id) AS sessions,"201909" AS month
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20190901' AND '20190930'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_decembre )
)
UNION ALL
(SELECT COUNT(DISTINCT user_pseudo_id),"201910"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191001' AND '20191031'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_decembre )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201911"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191101' AND '20191130'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_decembre )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"201912"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20191201' AND '20191231'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_decembre )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202001"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200101' AND '20200131'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_decembre )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202002"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200201' AND '20200229'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_decembre )
)
UNION ALL
(
SELECT COUNT(DISTINCT user_pseudo_id),"202003"
FROM `my-app.analytics_123456789.events_*`
WHERE
_table_suffix between '20200301' AND '20200331'
AND event_name="session_start"
AND user_pseudo_id IN (SELECT user FROM user_decembre )
)
)
ORDER BY month ASC
)
SELECT cohorte_september.sessions AS september_cohort,
cohorte_october.sessions AS october_cohort,
cohorte_november.sessions AS november_cohort,
cohorte_decembre.sessions AS december_cohort,
month
FROM cohorte_september
JOIN cohorte_october USING (month)
JOIN cohorte_november USING (month)
JOIN cohorte_decembre USING (month)
ORDER BY month ASC
Conclusion
In the next few months, Google App+Web will certainly become the new standard. This will come with more integrations between the Google Marketing Platform and Google Cloud Platform, especially BigQuery. If your SQL skills are a little rusty, I strongly advise you to get a refresh and play around with some demo data.
With easy access to raw data, quick and efficient calculations, and powerful data visualization, advanced digital analytics is becoming more and more mature. The future of all this is probably tighter integration with other business data.
Some have been telling us for ages that business intelligence and digital analytics should work together. Slowly but surely, it’s becoming a reality.