Trend Metrics Over Several Crawls

Why do we need a dedicated collection?

Each crawl lives in its own BQL collection. For instance, crawl.20230131 targets the crawl of January 31, 2023, while crawl.20221231 targets the crawl of December 31, 2022.

If we want to track the number of 404s per crawl, for example, we could make the following query:

{
    "collections": [
        "crawl.20230131",
        "crawl.20231231"
    ],
    "periods": [],
    "query": {
        "dimensions": [],
        "metrics": [
            {
               "field": "crawl.20230131.count_urls_crawl",
               "filters": {
                   "field": "crawl.20230101.http_code",
                   "predicate": "eq",
                   "value": 404
               }
            },
            {
               "field": "crawl.20231231.count_urls_crawl",
               "filters": {
                   "field": "crawl.20231231.http_code",
                   "predicate": "eq",
                   "value": 404
               }
            },
        ]
    }
}

It works, but:

  • It makes use of several collections that must be queried independently.
  • We have no choice but to use metric filters so that each metric is filtered using the corresponding crawl collection.

To retrieve the number of 404s for the last 20 crawls, the BQL query will become very cumbersome to write with 20 collections involved, and it will also be very inefficient to process.

For this reason, we created a dedicated collection to easily and efficiently trend metrics over several crawls. It is called trended_crawls.

Content of the trended_crawls Collection

Dimensions

All the dimensions available in the individual crawl collections are also available in trended_crawls: depth, http_code, content_type, byte_size, indexable.is_indexable etc...

There is one extra dimension: date. As all timestamped collections, the trended_crawls collection has a date dimension, which corresponds to the crawl date. For example, data coming from the crawl of January 31, 2023 will be registered with the date 20230131.

Metrics

Inside the crawl collection, there is only one metric, which is count_urls_crawl (Number of URLs on Crawl). The corresponding metric in trended_crawls is called count_urls_trended_crawls and has exactly the same behavior.

We also offer several metrics that can only be found in trended_crawls. Below are a few examples:

NameFieldType
No. of Indexable Pagescount_indexableInteger
No. of Pages at Level Ncount_depth_NInteger
Avg. HTML Load Timeaverage_delay_totalFloat
No. of Pages with Nofollow Meta Tagcount_nofollow_meta_tagInteger
No. of Pages with Internal Outlinks with 5xx Errorcount_url_outlinks_errors_5xxInteger

Please use the Collections Explorer to get the full list of metrics available in trended_crawls.

Segmentation

Each crawl can have its own segmentation, but for the segment fields to make sense in the trended_crawls, we must work with a common segmentation. For this reason, all crawls inside the trended_crawls are segmented using the most up-to-date segmentation, that is, the segmentation currently defined in your Segment Editor.

Examples of Queries

No. of 404s Over Time

Let's rewrite the query of the first section, but this time using trended_crawls and querying all crawls of 2022:

{
    "collections": [
        "trended_crawls"
    ],
    "periods": [
	    [
	        "20220101",
	        "20221231"
	    ]
    ],
    "query": {
        "dimensions": ["trended_crawls.period_0.date"],
        "metrics": ["trended_crawls.period_0.count_urls_trended_crawls"],
        "filters": {
            "field": "trended_crawls.period_0.http_code",
            "predicate": "eq",
            "value": 404
        },
        "sort": [
            {
	              "type": "dimensions",
                "index": 0,
                "order": "asc"
            }
        ]
    }
}

This query is now simpler with only one collection and one filter while still being more powerful because, thanks to the targeted period, we are querying all crawls of 2022 at once. The sort clause is optional. It is used here to sort the results by ascending order of date. Here is a sample of the request results:

{
    "results": [
        {
            "dimensions": [
                "2022-01-01"
            ],
            "metrics": [
                1645
            ]
        },
        {
            "dimensions": [
                "2022-02-01"
            ],
            "metrics": [
                551
            ]
        },
        ...
        {
            "dimensions": [
                "2022-12-01"
            ],
            "metrics": [
	            498
            ]
        },
    ],
    ...
}

No. of Slow URLs with at Least One Click

The date dimension is common to all timestamped collections. We can cross crawl data of each day with the data of search_console, visits, or any other timestamped collection for the same day effortlessly.

The following query targets the number of URLs with a high HTML load time and at least one organic impression (GSC) during the corresponding crawl day:

{
    "collections": [
        "trended_crawls",
        "search_console"
    ],
    "periods": [
        [
            "20220101",
            "20221231"
        ]
    ],
    "query": {
        "dimensions": [
            "trended_crawls.period_0.date"
        ],
        "metrics": [
            "trended_crawls.period_0.count_urls_trended_crawls"
        ],
        "filters": {
            "field": "search_console.period_0.url_exists_search_console",
            "predicate": "eq",
            "value": true
        },
        "sort": [
            {
                "type": "dimensions",
                "index": 0,
                "order": "asc"
            }
        ]
    }
}