03.14.2017 | Francisco Rodriguez

Asynchronicity in Three Languages

Consider the following scenario:

Imagine a client who hands you a spreadsheet with 4986 URLs from a news website. You evaluate the spreadsheet: many of the links seem related, a few of them seem disparate - but all of them contain the embed codes of a video provider that went bankrupt three weeks ago, and the client wants to see them removed.

An example of what a failed embed code looks like

Adjusting this is simple enough - you find out where the pages store their article markup within the database, you check each one of those records for the embed codes using a script, and you edit them accordingly. You set up a test environment, run your script, evaluate the results for a few of the URLs, and notice that they've had their embed codes removed successfully. You pass the URL verification task to QA, and after an hour or two, they also report that the 100 or so random URLs they've selected also have no embed codes. Satisfied, you report your findings to your client, and ask them for a good time to fire off the script in production.

This seems like standard fare for a web development firm, right? Now imagine this: what happens when the client wants to have all 4986 links checked?

Well, you use another script.

The Naive Attempt

In the following script, we used built-in PHP functions (along with some "persuasion") to load and verify the contents of a list of URLs using a hypothetical verify_embed_code() method. (I've left out the embed code verification code for the sake of brevity.)

Upon first glance, a script like this works fine - on a fast connection, this script is able to verify a set of 50 or so pages in about a minute. Give this script enough URLs with 2-3 seconds of page load time, though, and running this script will start to feel like going to a busy Walmart with a single cashier.

$urls = array("http://example.com/content/1", "http://example.com/blog/352", ... ); 

$invalid_urls = array();
$error_urls = array();

// Here lies the main bottleneck of this example - the following code // performs the requests one by one. Since file_get_contents has to wait // until its HTTP request is completed, this can out a huge damper on // testing speed if the requests take a long time. foreach ($urls as $url) {
    $content = @file_get_contents($url);

    if ($content === false) {
        $error_urls[] = $url;
    } else if( verify_embed_code($content) === FALSE ) {
        $invalid_urls[] = $url;
    }
}

print_r("FINAL RESULTS\n");
print_r("Invalid URLs:\n");
print_r($invalid_urls);
print_r("URLs with errors:\n");
print_r($error_urls);

This is an example of how scalability concerns can hinder efficient, repeatable testing. The need to overcome this hindrance invites the following question: what can be done to accelerate these tests?

Get More Cashiers

Fortunately, if there is one thing computers are good at nowadays, it's multitasking. Even something as basic as a cheap smartphone runs a myriad of simultaneous processes for performing tasks - monitoring phone calls, sending out notifications, rendering graphics, etc.

Therefore, it stands to reason that we can take advantage of that - for instance, we can increase the speed of our embed code test dramatically by running multiple requests at the same time:

// The following example uses the Guzzle HTTP client and its implementation of Promises
// to perform its requests. While it is possible to use PHP built-ins like curl_multi_exec
// to perform this task, we feel that the use of a proper abstraction is more important than
// the desire to steer clear of third-party dependencies.
use GuzzleHttp\Pool;
use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Request;

$urls = array("http://example.com/content/1", "http://example.com/blog/352", ... ); 
$invalid_urls = array();
$error_urls = array()

$client = new Client();
$tasks = array();
foreach ($urls as $url) {
    $tasks[$url] = new Request('GET', $url);
}

// In the spirit of maintaining the cashier metaphor, // we are currently declaring our need for a "Pool" of cashiers // with the following statement. We want them to take care of this // giant line of customers (customers > tasks), and we want them to // do something once each cashier is finished with a customer - like // say, handing over a receipt if the transaction succeeds. // Or calling the rent-a-cops if it doesn't. $pool = new Pool($client, $tasks, [
    'concurrency' => 5,
    'fulfilled' => function ($response, $url) {
        $contents = $response->getBody()->getContents();
        if( verify_embed_code($content) === FALSE ) {
            $invalid_urls[] = $url;
        }
    },
    'rejected' => function ($reason, $url) {
        $error_urls[] = $url;
    },
]);

// We fire off the pool of workers with the following statement. // The method name of promise() seems like a curious choice compared to something like // start(), there is a very good reason why this is the case: the script can now do // whatever it pleases while the requests are happening. (e.g.: an animated progress bar on the console) $promise = $pool->promise();

// We'll refrain from implementing that progress bar on this example, though. $promise->wait();

print_r("FINAL RESULTS\n");
print_r("Invalid URLs:\n");
print_r($invalid_urls);
print_r("URLs with errors:\n");
print_r($error_urls);

While this concept is quite applicable in a variety of scenarios - similar patterns can be used to write multiple files at once, or to perform multiple database queries without having to wait for the slow ones to finish - its use can be difficult due to the different ways programming languages attempt to implement it.

Something Involving Synonyms

For instance, let's take the following snippet of C# code: it performs the exact same task as our second PHP example, but it does so in a manner that's almost unrecognizable.

// This class utilizes the async/await pattern in C#'s built-in Standard Library.
// It was adapted from one of Theo Yaung's async examples from Stack Overflow.
public class URLChecker
{
    private static List<String> invalidUrls;
    private static List<String> errorUrls;

    static void Main()
    {
        invalidUrls = new List<string>();
        errorUrls = new List<string>();

        // Like in the previous example, we wait for our lookup task to finish before showing the results.         // And once again, this is entirely optional: replacing this statement with Task.Run(AsyncLookup)         // would allow it to continue execution - for this example, though, there's not much of a point in         // not waiting for the results before displaying them.         AsyncLookup().Wait();

        Console.WriteLine("FINAL RESULTS");
        Console.WriteLine("Invalid URLs:");
        invalidUrls.ForEach(i => Console.WriteLine("{0}", i));
        Console.WriteLine("URLs with errors:");
        errorUrls.ForEach(i => Console.WriteLine("{0}", i));
    }

    // C# uses the "async" keyword to identify tasks that can be run in multiple instances without stopping.     // While it is true that this particular functionality doesn't see much use in the AsyncLookup     // method itself, it does allow us the use of another powerful component: the "await" keyword.     static async Task AsyncLookup()
    {
        var urls = new List<String>{ "http://example.com/content/1", "http://example.com/blog/352", ... };

        var tasks = new List<Task>();


        // Before explaining the use of the await keyword, it would be prudent to explain         // what a semaphore is: it essentially is a counter variable that stops execution         // once its count reaches 0 - and we use it to prevent our script from swamping         // our test site with HTTP requests.         var semaphore = new SemaphoreSlim(initialCount: 5);

        foreach (var url in urls)
        {
            
            // Our first use of the "await" keyword anticipates that eventuality.             // Calling WaitAsync decreases the semaphores counter by one. However,             // WaitAsync by itself does not stop script execution. The compiler needs             // a way to recognize that a function call is going to take a while -             // and the "await" keyword provides just the means to do so.             await semaphore.WaitAsync();

            // Remember when we mentioned that "async" methods can be run with multiple instances without             // stopping? This also applies to anonymous functions: we will be using our urls loop             // to fill our tasks array with a LOT of embed code verification checks -             // and while it's true that these anonymous methods still make use of the "await" keyword to             // stop execution, those "await" pauses are limited to the execution of the anonymous function itself.             // The AsyncLookup method WILL keep running.             tasks.Add(
                Task.Run(async () =>
                {
                    try
                    {
                        var client = new HttpClient();
                        var resp = await client.GetAsync(url);

                        if (resp.IsSuccessStatusCode)
                        {
                            var content = await resp.Content.ReadAsStringAsync();
                            if (VerifyEmbedCode(content) == false)
                            {
                                invalidUrls.Add(url);
                            }
                        }
                        else
                        {
                            errorUrls.Add(url);
                        }
                    }
                    finally
                    {
                        semaphore.Release();
                    }
                }));
        }

        // Since the execution of the anonymous functions is not necessarily finished when the loop ends,         // we do want to wait till all of them are finished before allowing the Main method to continue.         await Task.WhenAll(tasks);
    }
}

Javascript, on the other hand, suffers from too many choices - a casual check through npm's repository will yield an absurd amount of modules that can implement this paradigm.

// The following example uses the popular "async" and "superagent" modules for node.js.
// This code should be browser-portable with most modern browsers through the use of a CDNJS import.
var async = require('async');
var request = require('superagent');

var urls = ["http://example.com/content/1", "http://example.com/blog/352", ... ];
var invalid_urls = [];
var error_urls = [];

// For reference purposes: a queue is a data structure in which entities are added at the // "rear terminal position" (aka the back of the cashier's line) and removed at // the "front terminal position" (aka the front of the cashier's line). var q = async.queue(function (task, callback) {
    request.get(task.url).end(function(err, res){
        if (err || !res.ok) {
            error_urls.push(task.url);
            callback();
        } else {
            if (verify_embed_code(res.body) === false) {
                invalid_urls.push(task.url);
            }
            callback();
        }
    });
}, 5);

// One particular detail about this implementation is the fact that // it does not require us to check the status of its task, // instead, results processing is delegated to a "drain" event. // When the queue fires off this event (aka: the cashier's lines goes empty), // we ask our code to display the final results. q.drain = function() {
    console.log("FINAL RESULTS");
    console.log("Invalid URLs:");
    console.log(invalid_urls);
    console.log("URLs with errors:");
    console.log(error_urls);
}

var tasks = urls.map(function(x) {
   return {url: x};
});

q.push(tasks);

Adaptability is Key

In the presence of so many alternatives, declaring the best means to implement asynchronicity in a post that only shows three of them would be... shortsighted, to put it lightly. We can, however, speak on our use cases for the issue this post was based on. Since we could not guarantee the presence of a development environment from the users of the embed code verifier, we favored the browser availability that Javascript provided - an advantage that would become null and void if we were to either enhance a Laravel site that needs to hit 5 web services at once from the server side or develop utility tools like a console-based Site Health checker that QA can use without setting up a development environment.

It is because of this that choosing the correct approach requires three things: careful examination of a project's needs, a general understanding of the concept of asynchronicity and the ability to adapt that concept accordingly.

Francisco Rodriguez

Developer

Subscribe to Spire Wire, a quarterly newsletter covering all of the latest news and views from Spire Digital.

Thank you!

Your email address has been added to our list.