cio: cached HTTP requests for a smooth Jupyter experience!
August 21, 2018
This library provides a thin wrapper around the wreq library (a simple HTTP client library). It is meant to be used with Jupyter: all requests will be stored on disk and served from the cache subsequently, even if your kernel gets restarted. The cache lookups are near-instantaneous thanks to the amazing LevelDB library. You can use
cio just like you would
wreq – instead of importing
CIO (which stands for Cached IO):
Then use the functions you are used to, like
The simplest way to build this library is to use Nix. To get started clone the cio repository (nmattia/cio), then run the following:
$ nix-shell helpers: > cio_build > cio_ghci > cio_notebook > cio_readme_gen
The helper functions will respectively build
cio, start a
ghci session for
cio, start a Jupyter notebook with
cio loaded and regenerate the README (this file is a Jupyter notebook!).
Three functions are provided on top of
get :: String -> CIO Response performs a (cached) request to the given URL. *
getWith :: Options -> String -> CIO Response performs a (cached) request to the given URL using the provided
getAllWith :: Options -> String -> Producer CIO Response performs several (cached) requests by lazily following the
Link headers (see for instance GitHub’s pagination mechanism).
Let’s see what happens when a request is performed twice. First let’s write a function for timing the requests:
Then we’ll generate a unique string which we’ll use as a dummy parameter in order to force
cio to perform the request the first time, so that we can time it:
Finally we use
getWith and set the
dummy query parameter to the
UUID we just generated and time the request:
That’s a pretty long time! When playing around with data in a Jupyter notebook waiting around for requests to complete is a real productivity and creativity killer. Let’s see what
cio can do for us:
Pretty nice! You might have noticed that the
CIO results were printed out, as
Show a => IO a would be in GHCi. As mentioned before,
cio is optimized for Jupyter workflows, and as such all
Show-able results will be printed directly to the notebook’s output. Lists of
Show-ables will be pretty printed, which we’ll demonstrate by playing with
cio’s other cool feature: lazily following page links.
In order to lazily fetch data
cio uses the
conduit library. The
getAllWith function is a
Responses (sorry, a
ConduitT i Response CIO ()) which are served from the cache when possible. Here we ask GitHub to give us only two results per page, and
cio will iterate the pages until the five expected items have been fetched (if you do the math that’s about 3 pages):
"jgm/pandoc" "koalaman/shellcheck" "PostgREST/postgrest" "purescript/purescript" "elm/compiler"
What if something goes wrong?
What’s the second hardest thing in computer science, besides naming and off-by-one errors? Cache invalidation, of course. For the cache’s sake, all your requests should be idempotent, but unfortunately that’s not always possible. Here
cio doesn’t assume anything but lets you deal with dirtying yourself by using either of these two functions:
dirtyReq :: String -> CIO (), like
getbut instead of fetching the response dirties the entry in the cache.
dirtyReqWith :: Options -> String -> CIO (), like
getWithbut instead of fetching the response dirties the entry in the cache.
If things went really wrong, you can always wipe the cache entirely…
… but where’s the cache?
The cache is set globally (reminder: this is a Jupyter-optimized workflow):
If you need a different cache file you can either change the global cache file:
setCacheFile :: FilePath -> IO ()
or run your
CIO code manually:
runCIOWith :: forall a. FilePath -> CIO a -> IO a
one more thing…
.. nope, that’s all! Enjoy!
Thank you ❤️