Making a Ridiculously Fast™ API Client

Design choices for a highly performant R package

Author

Josiah Parry

Published

June 6, 2024

I recently had the pleasure of publishing the R package {arcgisgeocode}. It is an R interface to the ArcGIS World Geocoder. You could say it is the “official” Esri geocoding R package.

To my knowledge, it is the fastest geocoding library available in the R ecosystem. The ArcGIS World Geocoder is made avialable through {tidygeocoder} as well as {arcgeocoder}.

{arcgisgeocode} provides the full functionality of the World Geocoder which includes bulk geocoding functionality which the other two do not. The other two packages provide an interface to the /findAddressCandidates and /reverseGeocode API endpoints. The former provides single address forward geocoding and the latter provides reverse geocoding.

{arcgisgeocode} is ~17x faster when performing single address geocoding and ~40x faster when performing reverse geocoding when compared to the community counterparts. There are 2 primary reasons why this is.

The prolific Kyle Barron responded to one of my tweets a few months ago.

isn't geocoding always bottlenecked by the server?
— Kyle Barron @kylebarron@mapstodon.space (@kylebarron2) March 29, 2024

This statement is true in an aboslute sense. But then if it is only the server that is the bottle neck, why does {arcgisgeocode} out-perform two other packages calling the exact same API endpoints?

The reasons are primarily two-fold.

JSON parsing is slow

The first is that both tidygeocoder and arcgeocoder rely on {jsonlite} to both encode json and parse json. I have said it many times before and I’ll say it again—jsonlite was a revolutionary R package but it has proven to be slow.

The way that these API requests work is that we need to craft JSON from R objects, inject them into our API request, and then process the JSON that we get back from the server.

Encoding R objects as text strings is slow. Reading text and converting them back into R objects is also slow.

This is tangentially why Apache Arrow is so amazing. It uses the same memory layout regardless of where you are. If we were using Arrow arrays and the API received Arrow IPC and sent Arrow IPC, we would be able serialize and deserialize much faster!!!!

Handling JSON with serde

serde_json is a Rust crate that handles serialization and deserialization of Rust structs. It takes the guess work out of encoding and decoding JSON responses because it requires that we specify what the json will look like. {arcgisgeocode} uses serde_json to perform JSON serialization and deserialization.

For example I have the following struct definition

pub struct Address {
    objectid: i32,
    #[serde(rename = "singleLine")]
    single_line: Option<String>,
    address: Option<String>,
    address2: Option<String>,
    address3: Option<String>,
    neighborhood: Option<String>,
    city: Option<String>,
    subregion: Option<String>,
    region: Option<String>,
    postal: Option<String>,
    #[serde(rename = "postalExt")]
    postal_ext: Option<String>,
    #[serde(rename = "countryCode")]
    country_code: Option<String>,
    location: Option<EsriPoint>,
}

These struct definitions plus serde_json all coupled with the extendr library means that I can process and create JSON extremely fast!

Using a request pool

Both {tidygeocoder} and {arcgeocoder} both use {httr} whereas {arcgisgeocode} uses {httr2}. There may be speed-ups inherent in switching.

But the primary difference is that in {arcgisgeocode}, we use a req_perform_parallel() with a small connection pool. This allows for multiple workers to be handling requests concurrently. That means there is less time being spent waiting for each request to be handled and then processed by our R code.

Note that with great power comes great responsibility. Using req_perform_parallel() without care may lead to accidentally committing a DDoS attack. For that reason we use a conservative number of workers.

Closing notes

While Kyle is correct in the absolute sense, that the bottleneck of performance does come down to the geocoding service, it is also true that the clients that we write to call these services might be adding additional performance overhead.

To improve performance, I would recommend identifying the slowest part and making it faster. In general, when it comes to API clients, this is almost always the (de)serialization and the request handling.

I don’t expect everyone to learn how to write Rust. But you can make informed decisions about what libraries you are using.

Learn how to parse json with Rust

If you are using jsonlite and you care about performance. Stop that. I strongly recommend using RccpSimdJson (for parsing only), yyjson (for both), and jsonify—in that order. You will find your code to be much faster.

Next, if you are making multiple requests to the same endpoint. Consider using a small worker pool using req_perform_parallel() and then watch how the speed improves.