Trend Metrics Over Several Crawls
Why do we need a dedicated collection?
Each crawl lives in its own BQL collection. For instance, crawl.20230131
targets the crawl of January 31, 2023, while crawl.20221231
targets the crawl of December 31, 2022.
If we want to track the number of 404s per crawl, for example, we could make the following query:
{
"collections": [
"crawl.20230131",
"crawl.20231231"
],
"periods": [],
"query": {
"dimensions": [],
"metrics": [
{
"field": "crawl.20230131.count_urls_crawl",
"filters": {
"field": "crawl.20230101.http_code",
"predicate": "eq",
"value": 404
}
},
{
"field": "crawl.20231231.count_urls_crawl",
"filters": {
"field": "crawl.20231231.http_code",
"predicate": "eq",
"value": 404
}
},
]
}
}
It works, but:
- It makes use of several collections that must be queried independently.
- We have no choice but to use metric filters so that each metric is filtered using the corresponding crawl collection.
To retrieve the number of 404s for the last 20 crawls, the BQL query will become very cumbersome to write with 20 collections involved, and it will also be very inefficient to process.
For this reason, we created a dedicated collection to easily and efficiently trend metrics over several crawls. It is called trended_crawls
.
Content of the trended_crawls
Collection
trended_crawls
CollectionDimensions
All the dimensions available in the individual crawl
collections are also available in trended_crawls
: depth
, http_code
, content_type
, byte_size
, indexable.is_indexable
etc...
There is one extra dimension: date
. As all timestamped collections, the trended_crawls
collection has a date
dimension, which corresponds to the crawl date. For example, data coming from the crawl of January 31, 2023 will be registered with the date 20230131
.
Metrics
Inside the crawl
collection, there is only one metric, which is count_urls_crawl
(Number of URLs on Crawl). The corresponding metric in trended_crawls
is called count_urls_trended_crawls
and has exactly the same behavior.
We also offer several metrics that can only be found in trended_crawls
. Below are a few examples:
Name | Field | Type |
---|---|---|
No. of Indexable Pages | count_indexable | Integer |
No. of Pages at Level N | count_depth_N | Integer |
Avg. HTML Load Time | average_delay_total | Float |
No. of Pages with Nofollow Meta Tag | count_nofollow_meta_tag | Integer |
No. of Pages with Internal Outlinks with 5xx Error | count_url_outlinks_errors_5xx | Integer |
Please use the Collections Explorer to get the full list of metrics available in trended_crawls
.
Segmentation
Each crawl can have its own segmentation, but for the segment fields to make sense in the trended_crawls
, we must work with a common segmentation. For this reason, all crawls inside the trended_crawls
are segmented using the most up-to-date segmentation, that is, the segmentation currently defined in your Segment Editor.
Examples of Queries
No. of 404s Over Time
Let's rewrite the query of the first section, but this time using trended_crawls
and querying all crawls of 2022:
{
"collections": [
"trended_crawls"
],
"periods": [
[
"20220101",
"20221231"
]
],
"query": {
"dimensions": ["trended_crawls.period_0.date"],
"metrics": ["trended_crawls.period_0.count_urls_trended_crawls"],
"filters": {
"field": "trended_crawls.period_0.http_code",
"predicate": "eq",
"value": 404
},
"sort": [
{
"type": "dimensions",
"index": 0,
"order": "asc"
}
]
}
}
This query is now simpler with only one collection and one filter while still being more powerful because, thanks to the targeted period, we are querying all crawls of 2022 at once. The sort
clause is optional. It is used here to sort the results by ascending order of date. Here is a sample of the request results:
{
"results": [
{
"dimensions": [
"2022-01-01"
],
"metrics": [
1645
]
},
{
"dimensions": [
"2022-02-01"
],
"metrics": [
551
]
},
...
{
"dimensions": [
"2022-12-01"
],
"metrics": [
498
]
},
],
...
}
No. of Slow URLs with at Least One Click
The date
dimension is common to all timestamped collections. We can cross crawl data of each day with the data of search_console
, visits
, or any other timestamped collection for the same day effortlessly.
The following query targets the number of URLs with a high HTML load time and at least one organic impression (GSC) during the corresponding crawl day:
{
"collections": [
"trended_crawls",
"search_console"
],
"periods": [
[
"20220101",
"20221231"
]
],
"query": {
"dimensions": [
"trended_crawls.period_0.date"
],
"metrics": [
"trended_crawls.period_0.count_urls_trended_crawls"
],
"filters": {
"field": "search_console.period_0.url_exists_search_console",
"predicate": "eq",
"value": true
},
"sort": [
{
"type": "dimensions",
"index": 0,
"order": "asc"
}
]
}
}
Updated about 1 year ago