Why do we need a dedicated collection?

Each crawl lives in its own BQL collection. For instance, crawl.20230131 targets the crawl of January 31, 2023, while crawl.20221231 targets the crawl of December 31, 2022.

If we want to track the number of 404s per crawl, for example, we could make the following query:

{
    "collections": [
        "crawl.20230131",
        "crawl.20231231"
    ],
    "periods": [],
    "query": {
        "dimensions": [],
        "metrics": [
            {
               "field": "crawl.20230131.count_urls_crawl",
               "filters": {
                   "field": "crawl.20230101.http_code",
                   "predicate": "eq",
                   "value": 404
               }
            },
            {
               "field": "crawl.20231231.count_urls_crawl",
               "filters": {
                   "field": "crawl.20231231.http_code",
                   "predicate": "eq",
                   "value": 404
               }
            },
        ]
    }
}

It works, but:

It makes use of several collections that must be queried independently.
We have no choice but to use metric filters so that each metric is filtered using the corresponding crawl collection.

To retrieve the number of 404s for the last 20 crawls, the BQL query will become very cumbersome to write with 20 collections involved, and it will also be very inefficient to process.

For this reason, we created a dedicated collection to easily and efficiently trend metrics over several crawls. It is calledtrended_crawls.

Content of the `trended_crawls` Collection

Dimensions

All the dimensions available in theindividual crawl collections are also available intrended_crawls : depth, http_code, content_type, byte_size, indexable.is_indexable etc...

There is one extra dimension: date. As all timestamped collections, the trended_crawls collection has adate dimension, which corresponds to the crawl date. For example, data coming from the crawl of January 31, 2023 will be registered with the date 20230131.

Metrics

Inside the crawl collection, there is only one metric, which is count_urls_crawl (Number of URLs on Crawl). The corresponding metric in trended_crawls is called count_urls_trended_crawls and has exactly the same behavior.

We also offer several metrics that can only be found intrended_crawls. Below are a few examples:

Name	Field	Type
No. of Indexable Pages	`count_indexable`	Integer
No. of Pages at Level N	`count_depth_N`	Integer
Avg. HTML Load Time	`average_delay_total`	Float
No. of Pages with Nofollow Meta Tag	`count_nofollow_meta_tag`	Integer
No. of Pages with Internal Outlinks with 5xx Error	`count_url_outlinks_errors_5xx`	Integer

Please use the Collections Explorer to get the full list of metrics available in trended_crawls.

Segmentation

Each crawl can have its own segmentation, but for the segment fields to make sense in the trended_crawls, we must work with a common segmentation. For this reason, all crawls inside thetrended_crawls are segmented using the most up-to-date segmentation, that is, the segmentation currently defined in your Segment Editor.

Examples of Queries

No. of 404s Over Time

Let's rewrite the query of the first section, but this time using trended_crawls and querying all crawls of 2022:

{
    "collections": [
        "trended_crawls"
    ],
    "periods": [
	    [
	        "20220101",
	        "20221231"
	    ]
    ],
    "query": {
        "dimensions": ["trended_crawls.period_0.date"],
        "metrics": ["trended_crawls.period_0.count_urls_trended_crawls"],
        "filters": {
            "field": "trended_crawls.period_0.http_code",
            "predicate": "eq",
            "value": 404
        },
        "sort": [
            {
	              "type": "dimensions",
                "index": 0,
                "order": "asc"
            }
        ]
    }
}

This query is now simpler with only one collection and one filter while still being more powerful because, thanks to the targeted period, we are querying all crawls of 2022 at once. The sort clause is optional. It is used here to sort the results by ascending order of date. Here is a sample of the request results:

{
    "results": [
        {
            "dimensions": [
                "2022-01-01"
            ],
            "metrics": [
                1645
            ]
        },
        {
            "dimensions": [
                "2022-02-01"
            ],
            "metrics": [
                551
            ]
        },
        ...
        {
            "dimensions": [
                "2022-12-01"
            ],
            "metrics": [
	            498
            ]
        },
    ],
    ...
}

No. of Slow URLs with at Least One Click

The date dimension is common to all timestamped collections. We can cross crawl data of each day with the data ofsearch_console, visits, or any other timestamped collection for the same day effortlessly.

The following query targets the number of URLs with a high HTML load time and at least one organic impression (GSC) during the corresponding crawl day:

{
    "collections": [
        "trended_crawls",
        "search_console"
    ],
    "periods": [
        [
            "20220101",
            "20221231"
        ]
    ],
    "query": {
        "dimensions": [
            "trended_crawls.period_0.date"
        ],
        "metrics": [
            "trended_crawls.period_0.count_urls_trended_crawls"
        ],
        "filters": {
            "field": "search_console.period_0.url_exists_search_console",
            "predicate": "eq",
            "value": true
        },
        "sort": [
            {
                "type": "dimensions",
                "index": 0,
                "order": "asc"
            }
        ]
    }
}