Downloading an AWS Glacier archive, step by step

Glacier is the low-cost cloud storage solution offered by Amazon. Files uploaded to AWS Glacier are called archives and archives are organized in vaults. If you want to upload a file to AWS Glacier, first you need to create a vault and then you can upload files in that vault. Vaults has names which are pretty important as many of the commands (talking here about AWS cli) need the vault name as an argument. For a vault you can also see the total size of the vault (all the uploaded archives in total) and the number of archives in it. And nothing more, no details about the archives in the vault. AWS Glacier maintains an index of the archives for every vault but this can only be accessed if you initiate an inventory download job. This can take up to several hours.

In this article I will show, step by step, how you can download an archive form AWS Glacier. Firstly, you need to find the name of the vault from which you will download the archive. Then you will download the index of the vault and hopefully you will identify the archive that you want to download. Finally, you will download the archive in multiple chunks and you will combine them. So, let’s get started.

Step 1 – Listing the vaults

There is no way to search for your archive, you need to know in which vault your archive is. A good idea is to maintain your own index of the files uploaded to AWS Glacier. The index could contain the name of the file, a description, the archive id (generated during upload) and whatever else you need to identify your file. For vaults listing you can use list-vaults command.

# the command
$ aws glacier list-vaults --account-id -

# the output
{
  "VaultList": [
    {
      "SizeInBytes": 9102027675,
      "VaultARN": "arn:aws:glacier:eu-west-1:502267884597:vaults/photos",
      "LastInventoryDate": "2017-01-16T08:06:17.699Z",
      "VaultName": "photos",
      "NumberOfArchives": 2,
      "CreationDate": "2017-01-15T10:59:58.303Z"
    }
  ]
}

The command will print all the vaults associated with your AWS account. Hopefully you can identify the vault in which your archive is located. Note down the name of the vault as it will be used in many of the following steps.

Step 2 – Initiating inventory download

Having the vault name, you can then initiate an inventory download job. You will need it so as to identify the archive. Basically, you need the archive id. If you already have it (in your own index), you can skip this and the next step. Run the command, an wait. Or you can create a SNS topic that will be triggered when the operation has finished.

# the command
$ aws glacier initiate-job \
        --account-id - \
        --vault-name photos \
        --job-parameters '{"Type": "inventory-retrieval"}'

# the output
{
  "location": "/502267884597/vaults/photos/jobs/gXfGnzChOw21eNpUAVpN1Ilt",
  "jobId": "gXfGnzChOw21eNpUAVpN1Ilt"
}

Note down the job id, you will need it for the next commands. If you want to check the status of the job, you can use the describe-job command. It took 3h:40m in my case. You can get it faster, but you have to pay more for this.

$ aws glacier describe-job \
        --account-id -\
        --vault-name photos \
        --job-id gXfGnzChOw21eNpUAVpN1Ilt

# the output when the command hasn't yet finished
{
  "InventoryRetrievalParameters": {
    "Format": "JSON"
  },
  "VaultARN": "arn:aws:glacier:eu-west-1:502267884597:vaults/photos",
  "Completed": false,
  "JobId": "gXfGnzChOw21eNpUAVpN1Ilt",
  "Action": "InventoryRetrieval",
  "CreationDate": "2017-04-29T21:54:32.943Z",
  "StatusCode": "InProgress"
}

# the output when done
{
  "CompletionDate": "2017-04-30T01:35:35.183Z",
  "VaultARN": "arn:aws:glacier:eu-west-1:502267884597:vaults/photos",
  "InventoryRetrievalParameters": {
    "Format": "JSON"
  },
  "Completed": true,
  "InventorySizeInBytes": 2273,
  "JobId": "gXfGnzChOw21eNpUAVpN1IltYzu",
  "Action": "InventoryRetrieval",
  "CreationDate": "2017-04-29T21:54:32.943Z",
  "StatusMessage": "Succeeded",
  "StatusCode": "Succeeded"
}

Step 3 – Downloading the inventory file

The command for downloading the output of the job is get-job-output. You need the name of the vault, the job id and a file name.

# the command
$ aws glacier get-job-output \
        --account-id - \
        --vault-name photos \
        --job-id gXfGnzChOw21eNpUAVpN1IltYzu inventory.json

# the output
{
    "status": 200,
    "acceptRanges": "bytes",
    "contentType": "application/json"
}

And check below how an inventory file looks like. You will get in this file the description of the archive which was provided during upload. This can help on identifying the archive. Note down the archive id.

{
  "VaultARN": "arn:aws:glacier:eu-west-1:502267884597:vaults/photos",
  "InventoryDate": "2017-01-16T07:36:52Z",
  "ArchiveList": [{
      "ArchiveId": "3ZS7U5dd3NnJK-8h9_XoDp9",
      "ArchiveDescription": "2015 photos",
      "CreationDate": "2017-01-15T18:49:17Z",
      "Size": 4534377311,
      "SHA256TreeHash": "0e7ab9f3ad2adb..."
    }, {
      "ArchiveId": "HgvDRXERI5pk6TfQGPnlBtXAA",
      "ArchiveDescription": "2016 photos",
      "CreationDate": "2017-01-15T20:53:55Z",
      "Size": 4567584828,
      "SHA256TreeHash": "1fc71814e54d398..."
    }
  ]
}

Step 4 – Initiating the download archive request

Before initiating the download request, you need to prepare a json file which will be used as parameter in the initiate-job command. It is pretty much what you did for initiating the inventory download. The difference is that now you are providing the job parameters form a file. And one more thing, check Data Retrieval Settings. There are three retrieval policies there, the Free Tire Only policy didn’t allow me to download my 4GB archive so I’ve changed the policy to No Retrieval Limit.

# the job parameters file content
{
  "Type": "archive-retrieval",
  "ArchiveId": "HgvDRXERI5pk6TfQGPnlBtXAA",
  "Description": "Retrieve 2016 photos archive"
}

# the command
$ aws glacier initiate-job \
    --account-id - \
    --vault-name photos \
    --job-parameters file://archive_retrierval_request.json

# the output of the command
{
  "location": "/502267884597/vaults/photos/jobs/tL7YyGYVuPLCWPz3NfFjeA3",
  "jobId": "tL7YyGYVuPLCWPz3NfFjeA3"
}

And you have to wait. It took 3h:47m in my case. If you configured a SNS topic for the vault, you will get a notification when the job has finished. At any time, you can interrogate the status of the job using the describe-job command. When the archive is ready for download, the describe-job output will look like below.

# the command
$ aws glacier describe-job \
	--account-id - \
	--vault-name photos \
	--job-id tL7YyGYVuPLCWPz3NfFjeA3

# the output
{
    "CompletionDate": "2017-04-30T11:42:47.735Z",
    "VaultARN": "arn:aws:glacier:eu-west-1:502267884597:vaults/photos",
    "RetrievalByteRange": "0-4567584827",
    "SHA256TreeHash": "1fc71814e54d3986bb...",
    "Action": "ArchiveRetrieval",
    "JobDescription": "Retrieve 2016 photos archive",
    "ArchiveId": "HgvDRXERI5pk6TfQGPnlBtXAA",
    "StatusMessage": "Succeeded",
    "StatusCode": "Succeeded",
    "Completed": true,
    "JobId": "tL7YyGYVuPLCWPz3NfFjeA3",
    "Tier": "Standard",
    "ArchiveSHA256TreeHash": "1fc71814e54d3986bb...",
    "CreationDate": "2017-04-30T07:55:16.628Z",
    "ArchiveSizeInBytes": 4567584828
}

Step 5 – Downloading the archive

Finally, you are ready to download the archive. If the archive is big (like more than 4GB in my case) it would be wiser to get it in chunks. You can see the size of the archive either in the inventory file or in the output of the describe-job command. I split the file as following (1GB chunks):

chunk 1          0-1073741823
chunk 2 1073741824-2147483647
chunk 3 2147483648-3221225471
chunk 4 3221225472-4294967295
chunk 5 4294967296-4567584827

To download the archive you have to use the same get-job-output command previously used for downloading the inventory file. You need the vault name, the job id, the range of bytes to retrieve from output and the file name where the content will be saved.

# the command for downloading the first part
$ aws glacier get-job-output \
	--account-id - \
	--vault-name photos \
	--range bytes=0-1073741823 
	--job-id tL7YyGYVuPLCWPz3NfFjeA3udRymZ6S
	photos.part1

# the output
{
    "status": 206,
    "acceptRanges": "bytes",
    "contentType": "application/octet-stream",
    "checksum": "0b69951bbe9e23faf941c8a302...",
    "contentRange": "bytes 0-1073741823/4567584828",
    "archiveDescription": "description"
}

The chunk size that I used was too big as the download command failed a couple of times, like this: [Errno 10054] An existing connection was forcibly closed by the remote host. When you successfully downloaded all the pieces, all you have to do is to concatenate them:

$ cat photos.part1 \
	photos.part2 \
	photos.part3 \
	photos.part4 \
	photos.part5 > photos.zip

And you are done! Well, there is one more thing you can do, to verify the integrity of every chunk and, in the end, the integrity of the archive. But I will leave it for now, maybe for a future article.

Leave a Reply

Your email address will not be published. Required fields are marked *