Export my SiteCrawler data
In this section, we will see how to export one million URLs from your crawl.
1. Get your configuration
We will need 4 pieces of information to run the export. All can be gathered by following the guide in Getting started.
- the
username
andproject_slug
, which identifies the project we are targeting - the
analysis_slug
, which identifies which crawl we are targeting for the export. - your
API Token
, which is used to identify you
In the rest of the tutorial, we will consider these values:
username
:botify-team
project_slug
:botify-blog
analysis_slug
:20210205
API Token
:123abc
2. The BQL query
This section is the BQL Query that we will run in order to fetch crawl data.
This query will fetch for each URL, it's
- depth
- date when it was crawled
- HTTP code
- noindex status with respect to the robots.txt
- title, description and first H1 tag
- content quality information (no. of words, similarity score)
- PageRank
- linking information (no. of unique outlinks and inlinks)
- content type
- size in bytes
- load time information (delay to the first received byte and last received byte)
- indexability/compliance information
{
"collections": ["crawl.20210205"],
"query": {
"dimensions": [
"url",
"crawl.20210205.depth",
"crawl.20210205.date_crawled",
"crawl.20210205.http_code",
"crawl.20210205.metadata.robots.noindex",
"crawl.20210205.metadata.title.content",
"crawl.20210205.metadata.description.content",
"crawl.20210205.metadata.h1.first",
"crawl.20210205.content_quality.pct_nearest_similar_page_score",
"crawl.20210205.content_quality.nb_words_not_ignored",
"crawl.20210205.content_quality.pct_words_ignored",
"crawl.20210205.content_quality.nb_words_total",
"crawl.20210205.content_quality.nb_simscore_pct_50",
"crawl.20210205.internal_page_rank.value",
"crawl.20210205.internal_page_rank.raw",
"crawl.20210205.internal_page_rank.position",
"crawl.20210205.inlinks_internal.nb.unique",
"crawl.20210205.inlinks_internal.percentile",
"crawl.20210205.outlinks_external.nb.unique",
"crawl.20210205.targeted_device",
"crawl.20210205.content_type",
"crawl.20210205.byte_size",
"crawl.20210205.delay_first_byte",
"crawl.20210205.delay_last_byte",
"crawl.20210205.compliant.is_compliant",
"crawl.20210205.compliant.reason.http_code",
"crawl.20210205.compliant.reason.content_type",
"crawl.20210205.compliant.reason.canonical",
"crawl.20210205.compliant.reason.noindex"
],
"metrics": [],
"sort": [1]
}
}
This query should give you a good overview of your crawl and first million pages.
3. Execute the API call
To launch the export, you will need to run the HTTP request to our servers.
You should be able to import the cURL command below into an HTTP tool if you use one.
Use your own configuration
Don't forget to replace
--header 'Authorization: Token 123abc'
by your own API token value. Replace123abc
"username": "botify-team",
by the project's username. Replacebotify-team
"project": "botify-blog",
by your project slug. Replacebotify-blog
"collections": ["crawl.20210205"],
and all fields by your analysis slug. Replace all20210205
curl --location --request POST 'https://api.botify.com/v1/jobs' \
--header 'Authorization: Token 123abc' \
--header 'Content-Type: application/json' \
--data-raw '{
"job_type": "export",
"payload": {
"username": "botify-team",
"project": "botify-blog",
"connector": "direct_download",
"formatter": "csv",
"export_size": 1000000,
"query": {
"collections": ["crawl.20210205"],
"query": {
"dimensions": [
"url",
"crawl.20210205.depth",
"crawl.20210205.date_crawled",
"crawl.20210205.http_code",
"crawl.20210205.metadata.robots.noindex",
"crawl.20210205.metadata.title.content",
"crawl.20210205.metadata.description.content",
"crawl.20210205.metadata.h1.first",
"crawl.20210205.content_quality.pct_nearest_similar_page_score",
"crawl.20210205.content_quality.nb_words_not_ignored",
"crawl.20210205.content_quality.pct_words_ignored",
"crawl.20210205.content_quality.nb_words_total",
"crawl.20210205.content_quality.nb_simscore_pct_50",
"crawl.20210205.internal_page_rank.value",
"crawl.20210205.internal_page_rank.raw",
"crawl.20210205.internal_page_rank.position",
"crawl.20210205.inlinks_internal.nb.unique",
"crawl.20210205.inlinks_internal.percentile",
"crawl.20210205.outlinks_external.nb.unique",
"crawl.20210205.targeted_device",
"crawl.20210205.content_type",
"crawl.20210205.byte_size",
"crawl.20210205.delay_first_byte",
"crawl.20210205.delay_last_byte",
"crawl.20210205.compliant.is_compliant",
"crawl.20210205.compliant.reason.http_code",
"crawl.20210205.compliant.reason.content_type",
"crawl.20210205.compliant.reason.canonical",
"crawl.20210205.compliant.reason.noindex"
],
"metrics": [],
"sort": [1]
}
}
}
}'
If the export was launched correctly, you should get a response like
{
"job_id": 99999,
"job_type": "export",
"job_url": "/v1/jobs/99999",
"job_status": "CREATED",
"payload": {...},
"results": null,
"date_created": "2021-03-15T16:45:48.110189Z",
"user": "botify-team",
"metadata": null
}
with the explicit payload.
If the job_status
is CREATED
, the job was created successfully 🎉
The information you will need here is the job_id
: 99999.
We will use it to fetch the jobs status.
4. Fetch the job status
Now that the job is in the pipeline, we will fetch it's status until it is done.
For more details, see Export job reference.
We will send a GET request using the job_id
from the previous response.
curl --location --request GET 'https://api.botify.com/v1/jobs/99999' \
--header 'Authorization: Token 123abc'
Which will return something like:
{
"job_id": 99999,
"job_type": "export",
"job_url": "/v1/jobs/99999",
"job_status": "DONE",
"results": {
"nb_lines": 956,
"download_url": "https://d121xa69ioyktv.cloudfront.net/collection_exports/a/b/c/abcdefghik987654321/botify-2021-03-15.csv.gz"
},
"date_created": "2021-03-15T16:45:48.110189Z",
"payload": {...},
"user": "botify-team",
"metadata": null
}
If the job_status
is PROCESSING
, wait a bit and run the same request until the status switches to DONE
.
5. Fetch the results
Once the job is done, the results
object will have a download_url
field. The URL links directly to your exported SEO data. Download it by accessing the given link.
6. Extract the result
Once the file downloaded, one might notice that the file ends with .csv.gz
. The data is compressed. Software on your Operating System should be able to extract the CSV file.
For more options about the data export options, see Export your data and it's subsections. Existing options are connecting this kind of export directly to your storage system through a connectors.
Updated almost 4 years ago