The importance of practice

In software development, I find it to be true that you never know how to use a tool properly until you have already used it at least once before. That’s why, when you take a wrong turn and need to start over with something, you can usually get back to where you were in half the time it took you originally.

It’s important to realize that it’s the experience gained during the first attempt that gets you this productivity boost and not having read the docs or watched some video training, as valuable as these can be for getting a high-level view of an area. Practice is important for mastery of anything and in a field like ours where we must be continuously learning, we must also be continuously practicing what we learn, turning it into real¬†experience that we can carry forward with us.

One approach I like to use in my own practice is to learn the key entities and operations of a technology, draw out how they all relate to each other and then write some tests to exercise those ideas. Unit tests are particularly good for this because they can be written in minutes and executed in seconds. When a football player is learning to take a free kick they do it by repeating the technique in isolation, they stand there kicking at goal over-and-over, gaining feedback from each attempt and making adjustments. The low development overhead and rapid feedback loop provided by unit tests mean that we can use them for practicing new techniques in isolation also.

One tool I have been sharpening recently is the use of cloud storage, specifically blob storage with the Azure storage SDKs. I’m going to highlight the key features of blob storage in Azure and talk you through some unit tests that exercise those features.

The code samples in this post are going to be Java, mainly for some variety as the blog has featured a lot of C# and Go recently. Don’t worry if you’re not a Java developer, the SDK is available in most languages and uses pretty much the same terminology and patterns in each (the backing storage services are the same no matter which language you use). So you should be able to easily translate the code snippets to your language of choice quite easily.

The elements of Azure storage

The Azure storage services are a suite of high-level storage solutions based on common data abstractions. Your average developer will recognize the semantics of most of them from the names alone. These are:

  • Block Blobs: storage of objects that are streamed in blocks.
  • Page Blobs: optimized for random IO, often used for storing VHDs used by VMs.
  • Append Blobs: like block blobs, but optimized for append operations.
  • Tables: Key-value data store. Data is structured but schemaless (think NoSQL).
  • Queues: Asynchronous message passing between components of a system.
  • Files: Like an SMB file share.

Blob stands for binary large object storage, basically unstructured binary data. Conversely, the offerings without blob in the name provide for structured data. To keep this post focused we’re going to focus on working with block blobs specifically, but all of the blob services share lots of the same functionality. So most of this post applies equally well to all of them.

Using a JUnit test suite I put together, we’re going to cover:

  • Basic uploads and downloads.
  • Copying from one blob to another.
  • Getting and setting metadata.
  • Synchronizing operations with leases.
  • Snapshotting and restoring blobs.
  • Restricting permissions with shared access signatures.

And if you stay with me long enough, I’ll show you the party-piece of block blobs: the multi-threaded upload of data in multiple blocks, with reassembly on the service end.

Let’s get going.

Exploring blob storage

  • Note that my entire test suite is available in a GitHub Gist.
  • The Azure Storage SDK for Java is available on GitHub.

Setup, storage accounts and containers

So we’re stood at the blob service’s front door, how do we get in? All Azure storage belongs in a storage account and each storage account is protected by a pair of storage account keys. Mostly, these keys are used as a shared secret between your client application and the storage service.

First, we’re going to create a storage account and associated keys for the unit test suite with PowerShell. You could also do this in Azure portal, but we always go with the more automation-friendly option here at AnchorLoop. Open PowerShell and follow along:

Import-Module AzureRM.Storage

# Login to Azure. This will open a dialog asking for your user credentials.
# You could also use a service principal in a more automated scenario, i.e.
# no interactive user logon.

# Create a resource group for the storage account. Choose a location suitable
# for you.
$resGroup = New-AzureRmResourceGroup -Name "blockblobtests" -Location "UK South"

# Create a new storage account. Local-redundant storage is fine for testing
# purposes. Note that storage accounts need to have globally-unique names (they
# are used in URIs), so the below name might not work for you. In that case,
# just choose something else.
$account = $resGroup | New-AzureRmStorageAccount -Name "blockblobtests" -SkuName "Standard_LRS"

# Get the storage account keys
$keys = $account | Get-AzureRmStorageAccountKey

# The two strings you need to try the JUnit test suite yourself are the storage
# account name and one of the keys. These can be accessed like so.

Now that we have the storage account and a key, let’s look at the setup method of the JUnit test suite:

private static CloudBlobContainer container = null;

public static void setup() throws URISyntaxException, InvalidKeyException, StorageException, IOException {
    InputStream inputStream = BlockBlobTest.class.getResourceAsStream("");
    Properties properties = new Properties();

    String storageConnectionString = "DefaultEndpointsProtocol=https;" +
        "AccountName=" + properties.getProperty("storage_account") + ";" +
        "AccountKey=" + properties.getProperty("storage_account_key");

    CloudStorageAccount storageAccount = CloudStorageAccount.parse(storageConnectionString);
    CloudBlobClient blobClient = storageAccount.createCloudBlobClient();
    // Container names must be valid DNS names, i.e. only lower case letters, numbers and dashes (-)
    container = blobClient.getContainerReference("testcontainer");


    // Delete any blobs from a previous run. We don't do this in teardown because it is sometimes
    // useful to inspect the test artifacts after a run.
    for (ListBlobItem item : container.listBlobs()) {
        if (item instanceof CloudBlockBlob) {
            CloudBlockBlob blockBlob = (CloudBlockBlob) item;
            blockBlob.delete(DeleteSnapshotsOption.INCLUDE_SNAPSHOTS, null, null, null);

This method:

  • Reads the storage account name and storage account key out of a .config file in the test resources directory. Get these values from your PowerShell session.
  • Builds a connection string out of them. Notice this is going to use HTTPS – I wouldn’t use unencrypted HTTP for interacting with my cloud storage and I hope you wouldn’t either.
  • From the connection string, we create an object for the storage account and a blob client. We use this to create a single blob container, all the tests share this container and it can be accessed as a static property on the test class.
  • Finally, we clear out any blobs in the container left over from the previous test run. Cleanup-on-start is useful when developing test suites as it allows you to manually inspect the test artifacts after the suite completes. If I didn’t want this, I could easily move this for-loop to the tearDown method.

Now we can start playing with blobs.

Basic upload and download

The following unit test performs a basic upload and download of an image file:

public void basicUploadAndDownload() throws Exception {
    URL imageUrl = BlockBlobTest.class.getResource("image.png");
    File imageFile = new File(imageUrl.getFile());
    CloudBlockBlob blob = container.getBlockBlobReference("basicUploadAndDownload.png");

    // Upload the file, calculating the MD5 checksum.
    MessageDigest uploadMD5 = MessageDigest.getInstance("MD5");
    try (DigestInputStream inputStream = new DigestInputStream(new FileInputStream(imageFile), uploadMD5)) {
        blob.upload(inputStream, imageFile.length());

    // The MD5 content calculated by the blob service should match our checksum.
    BlobProperties properties = blob.getProperties();
    assertEquals(properties.getContentMD5(), Base64.getEncoder().encodeToString(uploadMD5.digest()));

    // Download the uploaded file to the temp directory, calculating the MD5 checksum again.
    File tempFile = File.createTempFile("basicUploadAndDownload", ".png");
    MessageDigest downloadMD5 = MessageDigest.getInstance("MD5");
    try (DigestOutputStream outputStream = new DigestOutputStream(new FileOutputStream(tempFile.getAbsolutePath()), downloadMD5)) {;

    // Downloaded file MD5 should match blob content MD5
    assertEquals(properties.getContentMD5(), Base64.getEncoder().encodeToString(downloadMD5.digest()));

Uploading and downloading data to the blob service in Java is as simple as passing an input or output stream to CloudBlockBlob.upload or respectively.

Here, I use digest streams to calculate the MD5 digest of the file both before the upload and after the download. I compare these to the MD5 calculated by the blob service itself (accessible on the blob properties), to be sure that the stored file is not corrupted.

Copying from one blob to another

public void basicCopy() throws Exception {
    // Upload an image to our source blob.
    URL imageUrl = BlockBlobTest.class.getResource("image.png");
    File imageFile = new File(imageUrl.getFile());
    CloudBlockBlob sourceBlob = container.getBlockBlobReference("basicCopySource.png");
    try (FileInputStream inputStream = new FileInputStream(imageFile)) {
        sourceBlob.upload(inputStream, imageFile.length());

    // Copy source blob to destination blob.
    CloudBlockBlob destinationBlob = container.getBlockBlobReference("BasicCopyDest.png");

    // Check that source and destination match.
    BlobProperties sourceProperties = sourceBlob.getProperties();
    BlobProperties destinationProperties = destinationBlob.getProperties();
    assertEquals(sourceProperties.getContentMD5(), destinationProperties.getContentMD5());

This test uploads an image to the source blob, then copies it to a destination blob with the CloudBlockBlob.startCopy method, comparing their MD5s at the end. No more difficult than doing a simple upload or download, really.

Using metadata

You can set any metadata on blobs that makes sense to your application.

public void setAndGetMetadata() throws Exception {
    CloudBlockBlob blob = container.getBlockBlobReference("metadata.txt");
    blob.uploadText("I have metadata.");

    // Create some metadata, arbitrary key-value pairs.
    HashMap metadata = new HashMap();

    // setMetadata sets the local blob property and uploadMetadata commits it to the blob service.

    // downloadAttributes refreshes the blob metadata (amongst other properties) from the blob service.
    assertEquals(blob.getMetadata().get("Timestamp"), metadata.get("Timestamp"));

You set metadata by building a HashMap where both the keys and values are strings, then setting it to your blob with CloudBlockBlob.setMetadata and uploading it to the blob service with CloudBlockBlob.uploadMetadata. Note that setMetadata alone only modifies your local blob instance and does not perform a roundtrip to Azure.

Synchronizing blob access with a lease

Now that we can do all the simple operations, how to deal with race conditions where multiple clients are accessing our blobs at the same time? If we want to put a lock on a blob and know that only we can modify it while we’re using it, we can use leases.

public void uploadWithLease() throws Exception {
    CloudBlockBlob blob = container.getBlockBlobReference("leaseTests.txt");
    blob.uploadText("I am testing leases.");

    // Acquire the lease on our blob
    String leaseId = blob.acquireLease();
    AccessCondition leaseCondition = AccessCondition.generateLeaseCondition(leaseId);

    // Attempt to upload without using the lease. This will fail.
    try {
        String testMessage = "This upload should fail.";
    catch (StorageException ex) {
        // We expected a StorageException to be thrown, so just continue.

    // Now upload correctly with the lease
    String testMessage = "This upload with lease should succeed.";
    blob.uploadText(testMessage, null, leaseCondition, null, null);
    assertEquals(blob.downloadText(null, leaseCondition, null, null), testMessage);

    // Release the lease and ensure that the blob is no longer locked.
    testMessage = "Once the lease is released, we can upload without one again.";
    assertEquals(blob.downloadText(), testMessage);

A lease grants a mutex on a given blob. Once we have generated an access condition with a lease, all operations on the blob must be called with the access condition as an argument until the lease is released. You can see from the code that a simple upload without the access condition will fail if the blob is leased.

The longer forms of the various upload and download methods allow you to supply an AccessCondition, amongst other objects useful for working with blobs (which can be null if not required).

Creating and restoring to snapshots

Creating a snapshot of a blob is done using the CloudBlockBlob.createSnapshot method, which simply produces another blob with the same contents in the same container. Restoring a further modified blob to a snapshot looks just like the basic copy operation from earlier.

public void usingSnapshots() throws Exception {
    CloudBlockBlob originalBlob = container.getBlockBlobReference("usingSnapshots.txt");
    String originalText = "Original state.";

    // Create a snapshot of the original blob, it should have been created in the same container.
    CloudBlockBlob snapshot = (CloudBlockBlob) originalBlob.createSnapshot();
    assertEquals(snapshot.getContainer(), container);

    // Modify the original blob after snapshot-ing it, ensure it is modified.
    String modifiedText = "Modified state.";
    assertEquals(originalBlob.downloadText(), modifiedText);

    // Restoring to a snapshot is just a normal blob-to-blob copy.
    assertEquals(originalBlob.downloadText(), originalText);

Looks a bit of an anti-climax doesn’t it? What’s so special about a snapshot blob if we can copy to and from any other blob we want anyway? There is a little more to snapshots, which the next test will demonstrate.

public void invalidSnapshotOperations() throws Exception {
    CloudBlockBlob originalBlob = container.getBlockBlobReference("invalidSnapshotOperations.txt");
    originalBlob.uploadText("I am testing snapshots.");

    CloudBlockBlob snapshot = (CloudBlockBlob) originalBlob.createSnapshot();

    // Uploads are not allowed on snapshot blobs.
    String text = "Should not be able to upload to snapshot.";
    try {
    catch (IllegalArgumentException ex) {
        // Expected exception was thrown, continue

    // Metadata is now allowed either.
    HashMap metadata = new HashMap();
    text = "Uploading metadata to a snapshot is not allowed.";
    metadata.put("Invalid", text);
    try {
    catch (IllegalArgumentException ex) {
        // Expected exception was thrown, continue

    // Finally, you cannot create a snapshot of another snapshot.
    try {
        CloudBlob snapshotOfSnapshot = snapshot.createSnapshot();
        fail("Should not be able to create snapshot of snapshot.");
    catch (IllegalArgumentException ex) {
        // Expected exception was thrown, continue

We can see that snapshot blobs are read-only, which means we can trust that they will preserve a previous state of another blob without any nefarious clients modifying them further.

Restricting permissions with shared access signatures

We often don’t want client applications to have the full range of operations available, it is much safer to give them only the access permissions they need to perform their specific function. Careful management of access permissions can reduce the risk of a security exploit causing the theft or destruction of your data. For the blob service, access permissions can be controlled with a SharedAccessBlobPolicy and a shared access signature.

public void restrictingPermissions() throws Exception {
    CloudBlockBlob blob = container.getBlockBlobReference("restrictingPermissions");
    String blobContent = "Protected content.";

    // Create a shared access policy which allows reads only and expires after five minutes.
    SharedAccessBlobPolicy policy = new SharedAccessBlobPolicy();
    Date expiryTime = new Date(System.currentTimeMillis() + TimeUnit.MINUTES.toMillis(5));

    // Generate a SAS token and shared access signature from the shared access policy.
    String sasToken = blob.generateSharedAccessSignature(policy, null);
    StorageCredentialsSharedAccessSignature credentials = new StorageCredentialsSharedAccessSignature(sasToken);

    // From the SAS we can create a read-only version of the original blob.
    CloudBlockBlob readOnlyBlob = new CloudBlockBlob(credentials.transformUri(blob.getUri()));

    // Blob can be read successfully.
    assertEquals(readOnlyBlob.downloadText(), blobContent);

    // Blob cannot be written to.
    try {
        String text = "Should not be able to write to a blob with a SAS for reads only.";
    } catch (StorageException e) {
        // Expected exception was thrown, continue

    // Blob cannot be deleted.
    try {
        fail("Should not be able to delete a blob with a SAS for reads only.");
    } catch (StorageException e) {
        // Expected exception was thrown, continue

I don’t think I can explain shared access signatures (SAS) better than the official docs can, but I’ll attempt a layman’s description. A blob accessed with a shared access signature is accessed (by the SDK) via a signed URI. This URI has a bunch of query parameters in it that describes the permissions granted, a start and expiry time, amongst other properties. Most importantly, it contains a signature made up of all the other query parameters, which is then encrypted. This signature is used for authenticating requests.

To create a SAS with the SDK we:

  • Construct a SharedAccessBlobPolicy containing the select permissions we wish to grant and a time at which those permissions will expire.
  • Produce a SAS token for a specific blob using this policy object which can, in turn, be used to produce the full SAS.
  • We can use the SAS to transform the URI of the source blob into the signed URI, which can be used to initialize another blob instance that will honor these permissions.

In the above test, we demonstrate shared access signatures by creating a copy of the original blob with only read permissions. The read-only instance is not able to modify or delete the blob content. You can imagine how a broker service could be used to manage permissions on your data, delegating only the access needed by any other services that want to interact with it.

Parallel block upload

Now for the big one. The reason you might want to use block blobs over any alternatives. If you have followed on for this long, this is your reward. If you just scrolled straight to the bottom… well, we’ll just gloss over that!

A key feature of a block blob is that you can divide a large file into multiple byte streams, upload them all concurrently and then have the blob service reassemble them in the right order for you on the service-end. This could help you process data faster by utilizing any spare CPU cores you have available. Bear in mind though, that introducing concurrency doesn’t always make things faster. It could introduce a thread management overhead that could negate any upload performance gains, it depends on how big your data is and what sort of thread contention you get.

With the usually warning for anything multi-threaded out of the way, let’s take a look at how we might do this in Java:

/*  A callable class that encapsulates a partial upload of a file. A number of instances are passed to an
    ExecutorService to upload multiple portions of a file concurrently.
private class BlockUploadTask implements Callable {
    private final String id;
    private final ByteArrayInputStream byteStream;
    private final CloudBlockBlob blob;

    BlockUploadTask(String id, ByteArrayInputStream byteStream, CloudBlockBlob blob) { = id;
        this.byteStream = byteStream;
        this.blob = blob;

    public String getId() { return id; }

    public Void call() {
        try {
            blob.uploadBlock(id, byteStream, -1);
        } catch (StorageException e) {
            throw new RuntimeException(e);
        } catch (IOException e) {
            throw new RuntimeException(e);

        // Close the bytestream cleanly after upload.
        try {
        } catch (IOException e) {
            throw new RuntimeException(e);

        return null;

public void parallelBlockUpload() throws Exception {
    CloudBlockBlob blob = container.getBlockBlobReference("parallelBlockUpload.jpg");
    URL imageUrl = BlockBlobTest.class.getResource("largeimage.jpg");
    File largeFile = new File(imageUrl.getFile());

    /*  This code won't work with files over (numThreads * 2GB) in size because we couldn't fit a portion of
        the file into a byte array, whose maximum size would be Integer.MAX_VALUE (around 2GB worth of bytes).
        Rather than over-complicate the code to support extreme files, I'm just not going to.
    int numThreads = 8;
    if((largeFile.length() / numThreads) > Integer.MAX_VALUE)
        fail("Source image file is too large.");

    int bufferSize = (int) largeFile.length() / numThreads;
    List uploadTasks = new ArrayList();

    /*  Read the image file into multiple byte streams, creating a list of callable upload tasks.
        Also generate a base64-encoded ID for each block entry.
    MessageDigest uploadMD5 = MessageDigest.getInstance("MD5");
    Base64.Encoder encoder = Base64.getEncoder();
    try(DigestInputStream inputStream = new DigestInputStream(new FileInputStream(largeFile), uploadMD5)) {
        int count = 1;
        while(inputStream.available() > 0) {
            String id = encoder.encodeToString(String.format("%d", count).getBytes());
            byte[] bytes = new byte[bufferSize];
            uploadTasks.add(new BlockUploadTask(id, new ByteArrayInputStream(bytes), blob));


    // Perform the parallel execution of all the upload tasks.
    ExecutorService executor = Executors.newFixedThreadPool(numThreads);
    List<Future> results = executor.invokeAll(uploadTasks);
    executor.shutdown(); // blocks until executor work is finished

    /*  Build the block list in the order that the uploaded blocks need to be assembled by the service,
        this is basically an ordered list of the base64-encoded block ids. Then commit the list to complete
        the upload.
    ArrayList blocks = new ArrayList();
    uploadTasks.forEach((task) -> blocks.add(new BlockEntry(task.getId())));

    /*  The blob service won't calculate the MD5 content of a parallel block upload for us, like it does
        for simple uploads. Let's download the file ourselves and make sure the MD5 matches.
    File tempFile = File.createTempFile("parallelBlockUpload", ".jpg");
    MessageDigest downloadMD5 = MessageDigest.getInstance("MD5");
    try (DigestOutputStream outputStream = new DigestOutputStream(new FileOutputStream(tempFile.getAbsolutePath()), downloadMD5)) {;
    assertEquals(encoder.encodeToString(downloadMD5.digest()), encoder.encodeToString(uploadMD5.digest()));

Here’s a high-level walkthrough of the test:

  • We load a large image file from disk and divide it into eighths, as we’re using eight threads for the parallel upload.
  • We load each eighth into a separate byte array and generate a base64-encoded ID for each. We feed each id-byte stream pair into an instance of our custom, callable class: BlockUploadTask. This class encapsulates a partial file upload operation.
  • We create a thread pool to process the tasks with the ExecutorService and invoke them all. The thread pool works through the tasks until finished, then is disposed of.
  • To complete the upload, we need to call CloudBlockBlob.commitBlockList with the list of BlockEntry objects that represents the order that we want the uploaded blocks to be reassembled by the service. Each of these objects is instantiated with one of the base64-encoded IDs we generated earlier.
  • Finally, we download the file again to generate an MD5 digest that we can use to validate that the parallel upload and block reassembly worked as expected.

Why use the ExecutorService and not the parallel streams API?

Now for a slight programming language-related tangent. Java-heads might be wondering why I chose to write the concurrent portion of this test with a thread pool controlled by the ExecutorService instead of the parallel streams API introduced in Java 8.

Well, my first attempt at this test did use parallel streams to perform the concurrent upload. Here is a snippet of that old code:

// Parallel upload of the byte streams as individual blocks.
idToByteStreams.parallelStream().forEach((idToByteStream) -> {
    String id = idToByteStream.getKey();
    ByteArrayInputStream byteStream = idToByteStream.getValue();
    try {
        blob.uploadBlock(id, byteStream, -1);
    } catch (StorageException e) {
        throw new RuntimeException(e);
    } catch (IOException e) {
        throw new RuntimeException(e);

At this point, the id-byte stream pair was literally a key-value pair and the BlockUploadTask class did not exist. The parallel for-each lambda looks quite modern and cool, so why did I not stick with it?

I discovered that it’s not best practice to perform IO-intensive operations (like uploading stuff to a web service) with parallel streams because those underlying tasks get processed by ForkJoinPool.commonPool(), which is a thread pool shared by the entire Java application. If this were a larger application (not a test suite) and I had tied up all the common worker threads with potentially long-running, IO-intensive tasks then this could have terrible consequences for the responsiveness of the rest of the application. Some useful links on this area are:

Although I got parallel streams working for me and that it looked pretty, I didn’t feel (in good conscious) that I could give you an example that might bite you if you were using it in a serious application. So I reimplemented it using a dedicated thread pool from the ExecutorService rather than a shared one, this seems to be best practice for these kinds of operations.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: