Evan Boehs website Mastodon PGP Key email A drawing of an astronaut in space United States is jumping for joy

Notes on the cloudflare web analytics api

in
garden
tags

So the particular technical challenge on hand is that cloudflare is stingy about what analytics data they will give me over the API and I want all the analytics data to stop using cloudflare and start using umami again for this. Cloudflare likes aggerate statistics. Umami does not.

I mean, actually, aggerate statistics aren’t a bad idea, just they make it difficult for this one use case.

The cloudflare dashboard makes queries like

I will save you time
query GetRumAnalyticsTopNs {
    viewer {
        accounts(filter: { accountTag: $accountTag }) {
            total: rumPageloadEventsAdaptiveGroups(filter: $filter, limit: 1) {
                count
                sum {
                    visits
                }
            }
            topReferers: rumPageloadEventsAdaptiveGroups(
                filter: $filter
                limit: 15
                orderBy: [$order]
            ) {
                count
                avg {
                    sampleInterval
                }
                sum {
                    visits
                }
                dimensions {
                    metric: refererHost
                }
            }
            topPaths: rumPageloadEventsAdaptiveGroups(
                filter: $filter
                limit: 15
                orderBy: [$order]
            ) {
                count
                avg {
                    sampleInterval
                }
                sum {
                    visits
                }
                dimensions {
                    metric: requestPath
                }
            }
            topHosts: rumPageloadEventsAdaptiveGroups(
                filter: $filter
                limit: 15
                orderBy: [$order]
            ) {
                count
                avg {
                    sampleInterval
                }
                sum {
                    visits
                }
                dimensions {
                    metric: requestHost
                }
            }
            topBrowsers: rumPageloadEventsAdaptiveGroups(
                filter: $filter
                limit: 15
                orderBy: [$order]
            ) {
                count
                avg {
                    sampleInterval
                }
                sum {
                    visits
                }
                dimensions {
                    metric: userAgentBrowser
                }
            }
            topOSs: rumPageloadEventsAdaptiveGroups(
                filter: $filter
                limit: 15
                orderBy: [$order]
            ) {
                count
                avg {
                    sampleInterval
                }
                sum {
                    visits
                }
                dimensions {
                    metric: userAgentOS
                }
            }
            topDeviceTypes: rumPageloadEventsAdaptiveGroups(
                filter: $filter
                limit: 15
                orderBy: [$order]
            ) {
                count
                avg {
                    sampleInterval
                }
                sum {
                    visits
                }
                dimensions {
                    metric: deviceType
                }
            }
            countries: rumPageloadEventsAdaptiveGroups(
                filter: $filter
                limit: 200
                orderBy: [$order]
            ) {
                count
                avg {
                    sampleInterval
                }
                sum {
                    visits
                }
                dimensions {
                    metric: countryName
                }
            }
        }
    }
}

and

{
    "accountTag": "<x>",
    "filter": {
        "AND": [
            {
                "datetime_geq": "2023-12-12T04:21:53Z",
                "datetime_leq": "2024-01-11T04:21:53Z"
            },
            {
                "OR": [
                    {
                        "siteTag": "<x>"
                    }
                ]
            }
        ]
    },
    "order": "count_DESC"
}

This is how this API is supposed to be used. Lets abuse it.

I tried something like

{
    viewer {
        accounts(filter: { accountTag: $tag }) {
            rumPageloadEventsAdaptiveGroups(
                filter: { datetime_gt: $start }
                limit: 1000
                orderBy: [datetimeMinute_ASC]
            ) {
                dimensions {
                    countryName
                    date
                    datetimeMinute
                    deviceType
                    refererHost
                    refererPath
                    requestHost
                    requestPath
                    userAgentBrowser
                    userAgentOS
                }
                avg {
                    sampleInterval
                }
                sum {
                    visits
                }
                count
            }
        }
    }
}

and

{
    "tag": "<x>",
    "start": "2024-01-01T02:07:05Z"
}

I thought my query would work because there is just no way all these people will fall into minute groups. But I was only getting 60 back. Hmm. What’s important to understand is all this is is just guesses. One in 10 requests or so are saved, and in the grand scheme of things it all works out. So we need to run with that. Count and sum.visits are the same for me.

Umami Internals

There are two tables of interest here

website_event

event_id: UUID()
website_id: const UUID
session_id: ref session.session_id
created_at: date
url_path: string
url_query: string
referrer_path: string
referrer_query: string
referrer_domain: string
page_title: string
event_type: 1
event_name: ""

session:

session_id: UUID()
website_id: const UUID
hostname: string
browser: string
os: string
device: string
screen: string
language: string
country: string
subdivision1: ""
subdivision2: ""
city: ""
created_at: date

Most of this data maps over pretty well. The question really is, how do we map sessions to sessions and pages to pages (keeping in mind, of course, that all we are getting out of this is aggregate statistics)

lets take a closer look at what we are getting back from the cloudflare API

"avg": {
    // Average sample interval
    "sampleInterval": 10
},
// The number of pages viewed by end-users
"count": 10,
"sum": {
    // The number of pages viewed by end-users that were initiated from a different website (i.e. where the Document.referrer does not match the hostname)
    "visits": 10
}

So each visit corresponds to a session and each count or avg corrosponds to the amount of pages those 10 users visited (for all we know they reloaded the page an average of 2 times in this cluster):

v.data.viewer.accounts[0].rumPageloadEventsAdaptiveGroups.reduce(
    (a, b) => a + b.avg.sampleInterval,
    0,
); // 660
v.data.viewer.accounts[0].rumPageloadEventsAdaptiveGroups.reduce(
    (a, b) => a + b.count,
    0,
); // 660
v.data.viewer.accounts[0].rumPageloadEventsAdaptiveGroups.reduce(
    (a, b) => a + b.sum.visits,
    0,
); // 460
v.data.viewer.accounts[0].rumPageloadEventsAdaptiveGroups.filter(
    (a) => a.count != a.avg.sampleInterval,
); // []

All of these are multiples of 10. What’s interesting is that visits (what denotes a session) can be 0. This would indicate they clicked around on my site!

Let’s verify if that is indeed the case:

v.data.viewer.accounts[0].rumPageloadEventsAdaptiveGroups
    .filter((a) => a.sum.visits == 0)
    .map((a) =>
        v.data.viewer.accounts[0].rumPageloadEventsAdaptiveGroups.find(
            (b) =>
                b.dimensions.userAgentBrowser ==
                    a.dimensions.userAgentBrowser &&
                b.dimensions.deviceType == a.dimensions.deviceType &&
                a.dimensions.countryName == b.dimensions.countryName &&
                b.sum.visits > 0,
        ),
    ); // [{…}, {…}, {…}, undefined, undefined, undefined, {…}, undefined, undefined, undefined, undefined, {…}, undefined, {…}]

This is to be expected, of course. You need to be lucky to get samples consistent here, what’s to say that the first page the person visited counts as a sample? Well the ones we did match together we can happily associate.

Also just keep in mind that when we dry run at the very least the amount of pages should be the same.

What to do about the rest? I suppose we just need to make extra sessions.

Let’s make a game plan:

  1. Step through the list. For each visits, make a session. To determine the amount of time between the created_ats of each session, space each one out by ((time_of_next_group - time_of_this_group) / number_of_visits), starting at time_of_this_group
  2. Step through the list again. If there are 10 counts and 10 visits, make one event for each session. if there are 20 counts and 10 visits, make 2 events for each session. If there are no visits, continue to step 3
  3. If there are no sessions for this group, try to find another group with matching sessions. If you do, associate these events with that group. If you don’t, just make count worth of sessions. Not worth the drama.
  4. Insert

a little hacking later and https://github.com/boehs/site/blob/master/utils/cloudflare-analytics-to-umami-lord-save-me.ts is born

/node/notes-on-the-cloudflare-web-analytics-api.html