The Network is not Free


When attempt­ing to scale soft­ware sys­tems, that is, make it pos­si­ble for the sys­tem to take on a higher load or more users, peo­ple usu­ally take on of two strate­gies. ver­ti­cal scal­ing and hor­i­zon­tal scal­ing. Ver­ti­cal scal­ing is the prac­tice of buy­ing newer and faster com­put­ers on which to run the soft­ware while hor­i­zon­tal scal­ing is the prac­tice of sim­ply buy­ing more com­put­ers on which to spread the soft­ware across. Between the two, hor­i­zon­tal scal­ing is often inter­preted as hav­ing the bet­ter scal­ing poten­tial. Unfor­tu­nate­ly, this is not always the case and can lead to some naive attempts to solve very intractable problems.

The rea­son­ing peo­ple tend to have goes like this: When scal­ing up (ver­ti­cal­ly) the sys­tem is gen­er­ally bound by the max­i­mum power of cur­rent machin­ery. More­over, com­puter sys­tems become much more expen­sive and much more spe­cial­ized near the top of the com­put­ing power scale. The more pow­er­ful your sys­tem, the more it will cost to upgrade it and the closer you will get to a max­i­mum the­o­ret­i­cal capac­i­ty. Scal­ing out, on the other hand, the­o­ret­i­cally has no bounds. One can just keep adding more and more com­put­ers to a clus­ter and so long as the soft­ware is suf­fi­ciently par­al­lel in its design its capac­ity should increase smooth­ly. On top of that, low end servers are cheaper and cheaper to main­tain than high end spe­cial­ized equip­ment, so it fol­lows that one should get more bang for one’s buck when scal­ing hor­i­zon­tally ver­sus ver­ti­cal­ly.

Now, often­times this way of think­ing about the prob­lem is valid and hor­i­zon­tal scal­ing can be more ben­e­fi­cial than ver­ti­cal scal­ing. How­ev­er, the prob­lem isn’t always so sim­ple. Imag­ine you have a clus­ter of hun­dreds of low end servers, together process­ing mul­ti­ple ter­aflops of data. Now imag­ine that clus­ter is glued together on a sin­gle giga­bit eth­er­net LAN. What’s going to hap­pen? Not every pro­gram­mer real­izes this, but the speed of a typ­i­cal switched LAN is bound by the speed of the root switch. No mat­ter how you wire it, in order to pre­vent loops and mod­ern LAN will actively cut off redun­dant path­ways cre­at­ing bot­tle­necks. An appli­ca­tion run­ning on a giga­bit LAN, will in a worse case sce­nar­io, be bound to a giga­bit capac­i­ty, glob­ally across the net­work. For any suf­fi­ciently large, hor­i­zon­tally scaled sys­tem, a stan­dard LAN will be a major bot­tle­neck which you can’t over­come with par­al­lelism.

Now there are ways to get around this. There are spe­cial­ized switches you can buy to add capac­i­ty. You can add routers to your net­work to shape traf­fic on the clus­ters. You can even switch to things like Infini­Band which offers point-­to-­point con­nec­tion for your clus­ter nodes. Any of these and many other solu­tions exist to over­come net­work bot­tle­necks on clus­tered appli­ca­tions, how­ev­er, as soon as you adopt one you are scal­ing your net­work up and are los­ing some of the per­ceived advan­tages of a hor­i­zon­tally scaled sys­tem. In many cases it might just be cheaper and sim­pler to buy that more expen­sive com­put­er.

    Last update: 05/06/2014

    blog comments powered by Disqus