Thursday, September 24, 2015

IU storms ICFP 2015 -- check out our talks

I feel that all our students and postdocs did a great job at ICFP this year.  Now, several weeks have passed since Vancouver, and all the videos are available online for your viewing enjoyment.

My group works on parallel functional programming, and here's what we presented at the conference this year:

And outside of the parallelism topic area IU had some other great talks:
Finally, one nice surprise at the conference was running into a student we'd never met before who did a poster on comparing Accelerate and Obsidian -- -- two GPU DSLs developed partially and wholly by our postdocs Joel and Trevor.  Congrats, guys, on getting users and attention!  Even better, the student was one of the winners of the poster competition and thus got to present a talk at the main ICFP conference.

Tuesday, September 16, 2014

Apple Time machine: accomplishment unlocked -- Decreased trust!


The saying is true: you only know backup works when you've restored.

I got a new laptop and restored from time machine backup on an external drive attached to my airport extreme at home. My time machine backups had been going fine for years. It took a long time, and the network got disconnected so I had to restart it once, but eventually claimed to finish.

And there's the rub it *claimed* to restore, but it only partially restored. It's one thing to error, but why is it Apple UI's lie to me? (For example, my iPhone's iMessage has shown my message as delivered, whereas my wife's showed no sign of the message.)

For anyone else who runs into a failed Time Machine restore I'd like to write down what I've found as I manually attempt to synchronize critical folders with the backup image.
  1. First, my home directory appears complete. The incomplete restorations all have to do with top-level directories like "/foo" and "/bar" that are either missing entirely or have the following problems.
  2. Some subdirectories are just missing. My Aperture photo library had most years but was totally missing 2012 and 2013, as well as other misc files.
  3. Some files were present but zero length.
  4. Finally, the weird thing is that some files were actually corrupted. I found photos that were smaller size in the restore than in the latest backup image on the time machine disk. Indeed, these jpegs, could still be viewed but would be missing part of the image. 
The last makes me especially nervous because it's NOT consistent with a (sequential) transfer of files and directories that got interrupted at a single point in time. Of course, I was guilty of restarting the restore process as I'd mentioned at the top. So maybe the bottom line here is that restarting it without manually wiping the HD in between is not safe. (It's not idempotent.)

Right now I'm using Unison to do the repairs, but for greatest peace of mind I should probably wipe it and restore again... The problem is that I used this machine for a while before noticing that the time machine restore was a partial failure!!

Monday, January 6, 2014

Five flavors of the parallel AND operation

The issue of parallel, logical and operations was recently discussed at length on my student, Lindsey Kuper’s, blog.
This is a follow-up to that post, with the purpose of laying out a taxonomy of conjunction operations that might occur in a parallel language. (The same discussion applies equally to disjunction.) Below I classify these operations into five levels, from basic to full featured. A good context to imagine as we go through this is a computationally intensive search problem with lots of nested parallel conjunctions and disjunctions that expose opportunities for parallelism.
Before we get started, some ground rules. We need the operation to take some kind of deferred computations or thunks. Further, we assume these can be turned into futures for exposing parallelism. In our case, we use Haskell and the Par monad so this means that we expect something like this:
result <- span=""> asyncAnd comp1 comp2 
where comp1 and comp2 are monadic computations that, once executed, return a Bool value. But the distinctions below could just as well apply to a plain, non-monadic and operation in a lazy, pure language — for example, Haskell’s (&&) operator.

Level One

  1. Strict: Evaluate both comp1 and comp2 in parallel, and then return the logical and of the resulting booleans.
This "level one" and is parallel, but is also less work-efficient than the sequential left-to-right short-circuiting && operation found in most programming languages. It always evaluates both computations fully. Thus, our next upgrade is short-circuiting.

Level Two

Here it helps to generalize a bit and consider an and that takes a list of an arbitrary number of input computations:
  1. Short-circuiting: for a list of input computations, evaluate them using a machine-specific degree of parallelism, but also stop launching new computations whenever a False is encountered in an already-completed computation.
This is an intermediate category which captures operations which have some degree of parallelism but are asymmetric, much like left-to-right sequential (&&). That is, some input computations have the ability to short-circuit (deschedule) others, but the relationship is not fair; it’s not all-to-all.
You can build an N-way and meeting this limited criteria using (1) a level-one parallel and together with (2) a short-circuiting sequential and. The two operators could be combined by a heuristic strategy for determining how many parallel tasks are appropriate, and then "bottoming out" to the sequential version. Still, as long as the sequential and is used, inputs are handled asymmetrically; and as long as a level-one parallel and is used we will run unnecessary computations that can’t be descheduled or canceled. Indeed, this is a real tension: work-efficiency vs. parallelism.

Level Three

  1. Short-circuiting, symmetric: all input computations are potentially evaluated in parallel, and any one of them returning False quickly can potentially cause the result of the entire and operation to become available immediately.
Ok, so this helps us get early answers. But it doesn’t impose any requirements as to the work-efficiency of the entire operation. That is, it doesn’t mandate that once an N-way and operation returns False that remaining computations will be cancelled in bounded time. Thus our fourth level concerns cancellation.

Level Four

  1. Symmetric with cancellation: When an early answer is given, all other threads/tasks are told to stop computing because their answers are no longer needed. There exists a bound on how much computation lingering child-computations of the and may do after this point in time.
Anecdotally, I think level-four ands are pretty rare in practical parallel languages and libraries, simply because [transitive] task cancellation is not a common feature of modern parallel runtime systems. I would be happy to be wrong about this! So please share any links to standard library documentation that mention level-four parallel ands out there in the wild.

Level Five

Our final stop is a feature that is not strictly a property of the conjunction operator itself, but rather a complementary feature we find useful in some of our applications. It concerns the ability of canceled computations to do useful work, by storing partial results that are computed before cancellation occurs.
In an unrestricted parallel language, canceled computations may have arbitrary side effects, which, in general, race with the cancellation message itself. In a deterministic language such side effects cannot be allowed; rather, it is usually safe to cancel only "pure" computations. There is one exception, however, a memoized pure function provides a "side effect free" way of storing partial work.
  1. Cancellation + memoization: Canceled threads never provide a final result, but they can contribute to other computations by adding to their memo-tables.
One example of an algorithm which benefits from this combination is the subtype-checking algorithm for equi-recursive types in Chapter 21 of Types and Programming Languages. Verifying a subtype relation between two types requires [parallelizable] conjunctions and disjunctions (e.g. when checking product and sum types), but at the same time memoization can help performance because when typechecking real code, the same subtype checks are made repeatedly.
Our recent draft paper on "taming the parallel effect zoo", discusses the combination of cancellation and memoization further.
Is there a level six? Is there anything else we might ask of our parallel logical operators?

Final thoughts: Parallel conjunction/disjunction in GHC

GHC with "sparks" (futures) can do level-two ands, for example by sparking the arguments to (&&). But short-circuiting must follow a predetermined order. Further, cancellation is not possible with sparks, because sparked computations cannot be removed from the spark pool.
If one could use internal GHC functions to determine whether an object is a THUNK or CONSTR at runtime, it would then be possible to do more flexible short-circuiting, for example, sparking some elements of a [Bool] while executing others on the current thread, and inbetween executions, polling for completion of the already-sparked futures. Yet this road seems unappealing for a number of reasons.
Alternatively, rather than using the spark mechanism, it is certainly possible to use IO and MVars to implement parallel-and in Haskell, as Conal Elliot does in this post, using an unamb combinator. But forkIO is vastly higher overhead than sparks, with our own Par monad’s futures falling somewhere inbetween. Moreover, our desire to include transitive cancellation and memoization features motivates us to build these operators on a custom Par monad rather than with IO or sparks directly.

Wednesday, August 21, 2013

Zero to GHC development in an Ubuntu VM in 19 lines of Bash

If you're like me, you want to contribute to GHC and even have a few small tasks in mind, but since you're an occasional contributor you don't pull, build, and validate it every day. Thus when you come back to it periodically, getting it validating again can be a chore. For example, most recently it failed on my RHEL 6 work machine and Mac OS 10.8 laptop, but worked on my Ubuntu machine.

Given the complexity and fragility of the process, it would be good for us to have more standardized (hermetic) environments so that people can (1) get started easily and (2) reproduce eachother's build outcomes.

The idea of using a VM image for this purpose was discussed back in 2009. But it seems not to have happened. To get things moving in this direction, I came up with a script that bootstraps a working GHC development repository inside a VirtualBox VM running Ubuntu 12.04 LTS (either server or desktop worked for me).

The following Bash script is all there is to it:

#!/bin/bash
set -e
set -x
sudo apt-get install -y git gcc make autoconf libtool zlib1g-dev libgl1-mesa-dev libglu1-mesa libglu1-mesa-dev freeglut3-dev libncurses-dev libgmp-dev
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu/
sudo ln -s /usr/lib/x86_64-linux-gnu/libgmp.so.10 /usr/lib/libgmp.so.3
sudo ln -s /usr/lib/x86_64-linux-gnu/libgmp.so.10 /usr/lib/libgmp.so
wget http://www.haskell.org/ghc/dist/7.6.3/ghc-7.6.3-x86_64-unknown-linux.tar.bz2
tar xjvf ghc-7.6.3-x86_64-unknown-linux.tar.bz2
cd ghc-7.6.3
./configure && sudo make install
cd ..
wget http://lambda.haskell.org/platform/download/2013.2.0.0/haskell-platform-2013.2.0.0.tar.gz
tar xzvf haskell-platform-2013.2.0.0.tar.gz
cd haskell-platform-2013.2.0.0
./configure && make && sudo make install
sudo chmod ugo+rX -R /usr/local
cd $HOME
git clone git://github.com/ghc/ghc ghc-working
cd ghc-working
./sync-all --testsuite -r git://github.com/ghc get
time sh validate

Once I figure out how to shrink down an Ubuntu VM to a reasonable size I'll post a VM for download, or publish an Amazon Machine Image (AMI). Or, if someone knows Chef and would like to help me convert the above to a Chef recipe, that would be better than a bash script.

Sunday, August 18, 2013

Revisiting Google-Drive a year later: Still not ready for intensive use. Benchmark against AeroFS and BitTorrent Sync.



Last year, I eagerly anticipated the release Google Drive.  I had complained a lot about my experiences with other synchronization software, and fully expected Google to knock this one out of the park.  It's an application that should really emphasize Google's strengths: systems software, storage, scaling distributed systems.

In Spring 2012, I made the leap and moved all my personal cloud storage over to Google, but I ran into too many technical problems and gave up.  (Some of these problems I'll detail below.)  Now, a year later, I wanted to check in again and see how things are improving.

I'm afraid I have to report that there are still these major deal breakers for me, and perhaps for other "power users":
  • Scalability isn't there.  For example, if you try the simple benchmark of adding a folder with thousands of very small files, you'll see that maximum throughput is a few files per second.
  • Getting stuck ("Unable to sync") seems common.
  • Symlinks are still ignored.
It's surprising to me when solutions for syncing do not aggregate meta-data for changed files before communicating over the wire (e.g. like rsync).  The Google drive API seems to encourage per-file remote operations.  I heard there is some support for batching, but I'm not sure if that is specific to some Google APIs or generic across them.  It would sure help here.

Of course, these services all work great for storing small numbers of medium sized files.  Maybe there's no desire or need to support scaling and more intensive use?  Yet, I think even non-techie users may end up with large numbers of small files even if they don't created them directly (e.g. in my Aperture library).  For myself, I ultimately want something closer to a distributed file system.   For example, I like to edit files within a git checkout locally on my laptop and have them synced to a server where I run the code.  This requires three things:
  • Cross platform -- Linux/Mac in my case.
  • Low latency -- file edits should appear quickly on the other side.
  • Equally good treatment of large numbers of small files and small numbers of large files.
Alas, in spite of the massive increase in the number of cloud-based directory synchronization options, none seem to meet all three of these criteria.  Still.  I'll go through a list of disqualifying points at the end of this post.

The end result is that I still use the same solution I did ten years ago.  I run "unison -repeat 2" for linking working copies on different machines.   The only thing missing is convenient file-system watching via inotify (i.e. OS-driven notification of changes rather than scanning).  This is the killer feature that many of the newer cloud offerings have compared to unison, and it is the key to low-latency, as well as the always-on usage model Dropbox-style systems employ.  Unison has rudimentary support for integrating with a file-system watcher and I've sporadically had that functionality working, but it was fragile and hard to set up last time I tried it.


Wednesday, May 29, 2013

Towards Cloud-based, Crowd-sourced benchmarking? (HSBencher)

Producing benchmarking plots using text files and gnuplot has always left me disappointed. We never get things as “push button” as we would like. Further, there’s always too much spit and chewing gum involved, and the marginal cost of asking a new question of the data is too high.

Thus I’m pleased to announce “HSBencher”, a package I just released on Hackage, here (and github here), which tackles a small part of this problem. Namely, it handles running benchmarks while varying parameters and then uploading the resulting benchmark data to Google Fusion Tables.

The theory is that it should then be easier to: (1) access the data from any program, anywhere; (2) pass around live graphs that are just a URL; and (3) write scripts to analyze the data for trends. In this post I’ll describe how to use the library, and then I’ll speculate a bit about how it could be used to harvest benchmarking data from more machines via “crowd sourcing”.

But first, a few notes about where HSBencher fits in vis-a-vis other benchmarking frameworks.
  • HSBencher runs coarse granularity benchmarks, where each benchmark run involves launching a new process. This is appropriate for parallelism benchmarks which need to run for a while, and also for inter-language comparisons.
  • By contrast, the popular Criterion package is great for fine-grained Haskell benchmarks.
  • HSBencher is language-agnostic. What it knows about GHC, cabal, and make is encapsulated in plugins called BuildMethods, with the built-in ones here.
  • In contrast, there are plenty of language-specific benchmarking tools. For example, if you are writing JVM based code, you might useJMeter, and then you could even get some support from Jenkins CI. Likewise, for Haskell, Fibon offers some advantages in that it understands GHC and GHC configuration options.
  • Many large benchmark suites just come with simple build systems, such as a series of Makefiles. These can be wrapped to work with HSBencher.


HSBencher Installation

First, getting the library should be a one-liner, but requires GHC >= 7.4:
  cabal install hsbencher -ffusion
If you have problems with the fusion table flag turned on, try using the latest version of handa-gdata from here or here.  Also wipe out your ~/.cabal/googleAuthTokens directory.

(UPDATE: at the moment HSBencher needs handa-gdata-0.6.2, which is pre-release so use the HEAD version.)



HSBencher Usage Model


HSBencher is configured the same way as xmonad, by writing an executable Haskell file and importing the library.

  import HSBencher
  main = defaultMainWithBechmarks
        [ Benchmark "bench1/bench1.cabal" ["1000"] $
          Or [ Set NoMeaning (RuntimeParam "+RTS -qa -RTS")
             , Set NoMeaning (RuntimeEnv "HELLO" "yes") ] ]
The resulting executable or script will also take a number of command line arguments, but the main thing is defining the list of benchmarks, and the configuration space associated with each benchmark.
The configuration spaces are created with a nested series of conjunctions and disjunctions (And and Or constructors). And is used for setting multiple options simultaneously, and Or creates alternatives which will be exhaustively explored. Creating a series of conjoined Ors of course creates a combinatorial explosion of possible configurations to explore.

The Set form takes two things. The latter argument contains strings that are opaque to the benchmarking system. These are runtime/compile-time command-line arguments as well as environment variable settings. Together these strings implement the desired behavior (see the ParamSetting type). But it’s often useful for the benchmarking framework to know what a setting is actually doing in terms of commonly defined concepts like “setting the number of threads”. Hence the first parameter to Set is a machine-readable encoding of this meaning. For example, a meaning of Threads 32 attached to a setting will let the benchmark framework know how to file away the benchamrk result in the Google Fusion Table.

You can run the above script (without Fusion Table upload) using the following:
  runghc minimal.hs
That will leave the output data set in a text file (results_UNAME.dat) with the full log in bench_UNAME.log. For example:
 # TestName Variant NumThreads   MinTime MedianTime MaxTime  Productivity1 Productivity2 Productivity3
 #
 # Tue May 28 23:41:55 EDT 2013
 # Darwin RN-rMBP.local 12.3.0 Darwin Kernel Version 12.3.0: Sun Jan  6 22:37:10 PST 2013; root:xnu-2050.22.13~1/RELEASE_X86_64 x86_64
 # Ran by: rrnewton
 # Determined machine to have 8 hardware threads.
 #
 # Running each test for 1 trial(s).
 # Git_Branch: master
 # Git_Hash: f6e0677353680f4c6670316c9c91ce729f19ed0e
 # Git_Depth: 244
 #  Path registry: fromList []
 bench1          1000                 none     0       0.3 0.3 0.3
 bench1          1000                 none     0       0.3 0.3 0.3
We’ll return to this below and see the fuller schema used by the Fusion table output.

 Setting up Fusion Tables

Because you installed with -ffusion above, you already have the Fusion Tables library.  Unfortunately, the Haskell bindings for Google fusion tables are quite limited at the moment, and not that robust. I’m afraid right now [2013.05.29] you need to create at least one table manually before following the instructions below.

Once you’re done with that, go ahead and go to the Google API Console and click Create project.


Next you’ll want to flip the switch to turn on “Fusion Tables API”. It will look like this:



Then “Create an OAuth 2.0 client ID…”:



And select “Installed Application”:


Finally you’ll see something like this on the “API Access” tab, which lists the Client ID and “secret”:


You can provide those to your benchmarker on the command line, but it’s often easier to stick them in environment variables like this:
export HSBENCHER_GOOGLE_CLIENTID=916481248329.apps.googleusercontent.com
export HSBENCHER_GOOGLE_CLIENTSECRET=e2DVfwqa8lYuTu-1oKg42iKN
After that, you can run the benchmark script again:
runghc minimal.hs --fusion-upload --name=minimal
When that finishes, if you go to drive.google.com (and filter for Tables) you should see a new table called “minimal”. And will see a similar record as above:



The rest of the columns currently included are:
MINTIME_PRODUCTIVITY
MEDIANTIME_PRODUCTIVITY
MAXTIME_PRODUCTIVITY
ALLTIMES
TRIALS
COMPILER
COMPILE_FLAGS
RUNTIME_FLAGS
ENV_VARS
BENCH_VERSION
BENCH_FILE
UNAME
PROCESSOR
TOPOLOGY
GIT_BRANCH
GIT_HASH
GIT_DEPTH
WHO
ETC_ISSUE
LSPCI
FULL_LOG
If you play around with Fusion Tables, you’ll find that you can do basic plotting straight off this data, for example, threads vs. time:

(Beyond 12 is using hyperthreading on this machine.) Filtering support is great, group-by or other aggregation seems absent currently. Thus you’ll ultimately want to have other scripts that ingest the data, analyze it, and republish it in a summarized form.

 Benchmark format and BuildMethods

The Benchmark data structure in the example above specifies a .cabal file as the target, and 1000 as the command line arguments to the resulting executable. (Note that the RunTimeParams also ultimately get passed as command line arguments to the executable.)

The convention in HSBencher is that a path uniquely identifies an individual benchmark; properties of that file (or the surrounding files) determine which build methods applies. For .cabal or .hs files, the extensions activates the appropriate build method.

The interface for build methods is defined here. and the full code of the ghc build method is show below:
ghcMethod = BuildMethod
  { methodName = "ghc"
  , canBuild = WithExtension ".hs"
  , concurrentBuild = True 
  , setThreads = Just $ \ n -> [ CompileParam "-threaded -rtsopts"
                               , RuntimeParam ("+RTS -N"++ show n++" -RTS")]
  , clean = \ pathMap bldid target -> do
     let buildD = "buildoutput_" ++ bldid
     liftIO$ do b <- span=""> doesDirectoryExist buildD
                when b$ removeDirectoryRecursive buildD
     return ()
  , compile = \ pathMap bldid flags target -> do
     let dir  = takeDirectory target
         file = takeBaseName target
         suffix = "_"++bldid
         ghcPath = M.findWithDefault "ghc" "ghc" pathMap
     log$ tag++" Building target with GHC method: "++show target  
     inDirectory dir $ do
       let buildD = "buildoutput_" ++ bldid
       liftIO$ createDirectoryIfMissing True buildD
       let dest = buildD </> file ++ suffix
       runSuccessful " [ghc] " $
         printf "%s %s -outputdir ./%s -o %s %s"
           ghcPath file buildD dest (unwords flags)
       return (StandAloneBinary$ dir </> dest)
  }
 where
  tag = " [ghcMethod] "
That’s it!

In this case we use a custom build directory triggered off the build ID so that concurrent builds of different configurations are possible. More details on the conventions that benchmarks should follow can be found in the README.

 Crowd sourcing?

Performance regression testing, say, for a compiler like GHC, is no easy thing. The surface area is large (libraries, language features), as is the variety of hardware architectures it will run on in the wild. One idea is that if we could make it absolutely trivial for users to run the benchmark suite, then we could aggregate that data in the cloud, and this could provide a variety of combinations of OS, hardware, and libraries.

Unfortunately, the Google Drive permissions model doesn’t support this well right now. Each Fusion Table is either readable or editable by a particular usable. There’s no way to switch one to “append-only” or even make it world-writable, even if one wanted to live dangerously.

Nevertheless, it would be possible to follow a “pull request” kind of model where a particular HSBencher-based client uploads to a personal Fusion table, and then sends a request to a daemon that ingests the data into the global, Aggregated fusion table for that project. We’ll see.

Future work and feedback

For our own use, I’ve converted the monad-par benchmarks, turning them into this script. I have my eye on the GHC nofib tests as another target. If you get a chance to try HSBencher, let me know how it works for you. It will need many more features, I’m sure, and I’d like to hear which your benchmark suite needs.

Friday, April 5, 2013

Don't mux parallel shell output! Introducing Hydra-print.

In this age of multicore-everything (your laptop, your phone, your toaster) there’s no excuse for anything not to run in parallel, right? So what about text-based shell programs? On the one hand, it has always been easy to launch things in parallel, say with bash’s support for backgrounding jobs, or using make -j.


But, on the other hand, I’ve never been happy with the output of make -j. Interleaving lines of output from parallel subprocesses isn’t very user friendly. (Indeed, cabal -j used to do the same thing, and recently it has switched to simply hiding the output instead.) Thus I’ve written a little utility than dynamically splits the terminal window when the compute job enters a parallel region (video below).

It’s called “hydra-print” and is both a Haskell library and a pair of command-line programs (hydra-view and hydra-head). You can find it at its Github page here and the released package is on Hackage here.

After running cabal install hydra-print, the simplest way to use it is to bring up a server like this:
hydra-view
And then arrange for a script to pipe the output of multiple commands to it in parallel:
(command1 | hydra-head) &
(command2 | hydra-head) &
It’s possible to compose command’s output sequentially as well, by asking hydra-head to return a named pipe, and deleting it to signal that the stream is finished.
p=`hydra-head -p`
command1 >> $p
command2 >> $p
rm -f $p
Further, its often convenient to use hydra-view to launch another command, using it to seed the view with output:
hydra-view -- make -j
The repository contains an example Makefile here, which shows how to adapt a Makefile to use hydra-print without any direct support builtin to Make. Of course, built-in support is better, and I hope to come up with a patch for cabal-install to use hydra-print soon.

Using the Haskell Library

The library is based on the recent io-streams package, and makes it very easy to send several kinds of Haskell data streams into hydra-print.

import qualified System.IO.Streams as S
import UI.HydraPrint
The two main ways to use the library are to (1) bring up a static interface with a given number of windows, or (2) bring up a dynamic view monitoring a changing collectionn of streams.

The former can be called with hydraPrintStatic defaultHydraConf (zip names strms). It’s type signature is:
hydraPrintStatic :: HydraConf -> [(String, InputStream ByteString)] -> IO ()

As you can see, it takes a fixed number of streams (in a list). A dynamic set of streams is represented as – what else – a stream of streams!
hydraPrint :: HydraConf -> InputStream (String, InputStream ByteString) -> IO ()

In a parallel program one might use a Control.Concurrent.Chan to aggregate streams and send them to hydraPrint:

strmSrc <- span=""> chanToInput strmChan
hydraPrint defaultHydraConf strmSrc

Finally, I think the most obvious improvement to make to this little utility would be to provide the option of a tmux interface instead. This would provide much more functionality and should be pretty straightforward, because tmux itself is quite scriptable. Emacs [client] would be another option.

Enjoy!