Export my SiteCrawler data

In this section, we will see how to export one million URLs from your crawl.

1. Get your configuration

We will need 4 pieces of information to run the export. All can be gathered by following the guide in Getting started.

  • the username and project_slug, which identifies the project we are targeting
  • the analysis_slug, which identifies which crawl we are targeting for the export.
  • your API Token, which is used to identify you

In the rest of the tutorial, we will consider these values:

  • username: botify-team
  • project_slug: botify-blog
  • analysis_slug: 20210205
  • API Token: 123abc

2. The BQL query

This section is the BQL Query that we will run in order to fetch crawl data.
This query will fetch for each URL, it's

  • depth
  • date when it was crawled
  • HTTP code
  • noindex status with respect to the robots.txt
  • title, description and first H1 tag
  • content quality information (no. of words, similarity score)
  • PageRank
  • linking information (no. of unique outlinks and inlinks)
  • content type
  • size in bytes
  • load time information (delay to the first received byte and last received byte)
  • indexability/compliance information
{
  "collections": ["crawl.20210205"],
  "query": {
    "dimensions": [
      "url",
      "crawl.20210205.depth",
      "crawl.20210205.date_crawled",
      "crawl.20210205.http_code",
      "crawl.20210205.metadata.robots.noindex",
      "crawl.20210205.metadata.title.content",
      "crawl.20210205.metadata.description.content",
      "crawl.20210205.metadata.h1.first",
      "crawl.20210205.content_quality.pct_nearest_similar_page_score",
      "crawl.20210205.content_quality.nb_words_not_ignored",
      "crawl.20210205.content_quality.pct_words_ignored",
      "crawl.20210205.content_quality.nb_words_total",
      "crawl.20210205.content_quality.nb_simscore_pct_50",
      "crawl.20210205.internal_page_rank.value",
      "crawl.20210205.internal_page_rank.raw",
      "crawl.20210205.internal_page_rank.position",
      "crawl.20210205.inlinks_internal.nb.unique",
      "crawl.20210205.inlinks_internal.percentile",
      "crawl.20210205.outlinks_external.nb.unique",
      "crawl.20210205.targeted_device",
      "crawl.20210205.content_type",
      "crawl.20210205.byte_size",
      "crawl.20210205.delay_first_byte",
      "crawl.20210205.delay_last_byte",
      "crawl.20210205.compliant.is_compliant",
      "crawl.20210205.compliant.reason.http_code",
      "crawl.20210205.compliant.reason.content_type",
      "crawl.20210205.compliant.reason.canonical",
      "crawl.20210205.compliant.reason.noindex"
    ],
    "metrics": [],
    "sort": [1]
  }
}

This query should give you a good overview of your crawl and first million pages.

3. Execute the API call

To launch the export, you will need to run the HTTP request to our servers.
You should be able to import the cURL command below into an HTTP tool if you use one.

🚧

Use your own configuration

Don't forget to replace

  • --header 'Authorization: Token 123abc' by your own API token value. Replace 123abc
  • "username": "botify-team", by the project's username. Replace botify-team
  • "project": "botify-blog", by your project slug. Replace botify-blog
  • "collections": ["crawl.20210205"], and all fields by your analysis slug. Replace all 20210205
curl --location --request POST 'https://api.botify.com/v1/jobs' \
--header 'Authorization: Token 123abc' \
--header 'Content-Type: application/json' \
--data-raw '{
  "job_type": "export",
  "payload": {
    "username": "botify-team",
    "project": "botify-blog",
    "connector": "direct_download",
    "formatter": "csv",
    "export_size": 1000000,
    "query": {
      "collections": ["crawl.20210205"],
      "query": {
        "dimensions": [
          "url",
          "crawl.20210205.depth",
          "crawl.20210205.date_crawled",
          "crawl.20210205.http_code",
          "crawl.20210205.metadata.robots.noindex",
          "crawl.20210205.metadata.title.content",
          "crawl.20210205.metadata.description.content",
          "crawl.20210205.metadata.h1.first",
          "crawl.20210205.content_quality.pct_nearest_similar_page_score",
          "crawl.20210205.content_quality.nb_words_not_ignored",
          "crawl.20210205.content_quality.pct_words_ignored",
          "crawl.20210205.content_quality.nb_words_total",
          "crawl.20210205.content_quality.nb_simscore_pct_50",
          "crawl.20210205.internal_page_rank.value",
          "crawl.20210205.internal_page_rank.raw",
          "crawl.20210205.internal_page_rank.position",
          "crawl.20210205.inlinks_internal.nb.unique",
          "crawl.20210205.inlinks_internal.percentile",
          "crawl.20210205.outlinks_external.nb.unique",
          "crawl.20210205.targeted_device",
          "crawl.20210205.content_type",
          "crawl.20210205.byte_size",
          "crawl.20210205.delay_first_byte",
          "crawl.20210205.delay_last_byte",
          "crawl.20210205.compliant.is_compliant",
          "crawl.20210205.compliant.reason.http_code",
          "crawl.20210205.compliant.reason.content_type",
          "crawl.20210205.compliant.reason.canonical",
          "crawl.20210205.compliant.reason.noindex"
        ],
        "metrics": [],
        "sort": [1]
      }
    }
  }
}'

If the export was launched correctly, you should get a response like

{
    "job_id": 99999,
    "job_type": "export",
    "job_url": "/v1/jobs/99999",
    "job_status": "CREATED",
    "payload": {...},
    "results": null,
    "date_created": "2021-03-15T16:45:48.110189Z",
    "user": "botify-team",
    "metadata": null
}

with the explicit payload.
If the job_status is CREATED, the job was created successfully :tada:

The information you will need here is the job_id: 99999.
We will use it to fetch the jobs status.

4. Fetch the job status

Now that the job is in the pipeline, we will fetch it's status until it is done.
For more details, see Export job reference.

We will send a GET request using the job_id from the previous response.

curl --location --request GET 'https://api.botify.com/v1/jobs/99999' \
--header 'Authorization: Token 123abc'

Which will return something like:

{
    "job_id": 99999,
    "job_type": "export",
    "job_url": "/v1/jobs/99999",
    "job_status": "DONE",
    "results": {
        "nb_lines": 956,
        "download_url": "https://d121xa69ioyktv.cloudfront.net/collection_exports/a/b/c/abcdefghik987654321/botify-2021-03-15.csv.gz"
    },
    "date_created": "2021-03-15T16:45:48.110189Z",
    "payload": {...},
    "user": "botify-team",
    "metadata": null
}

If the job_status is PROCESSING, wait a bit and run the same request until the status switches to DONE.

5. Fetch the results

Once the job is done, the results object will have a download_url field. The URL links directly to your exported SEO data. Download it by accessing the given link.

6. Extract the result

Once the file downloaded, one might notice that the file ends with .csv.gz. The data is compressed. Software on your Operating System should be able to extract the CSV file.

For more options about the data export options, see Export your data and it's subsections. Existing options are connecting this kind of export directly to your storage system through a connectors.