Moving, faster lines and page flipping

  Journal:    Dr. Dobb's Journal  Dec 1992 v17 n12 p143(5)
  -------------------------------------------------------------------------
  Title:     Moving, faster lines, and page flipping. (Graphics
             Programming) (Column)
  Author:    Abrash, Michael

  Abstract:  Graphics programming can be optimized, however if the program
             is running well it may be better not to try and change it.
             Serge Mathieu of Concepteva Inc has a useful twist on
             animation.  Serge's system is a combination of page flipping
             and dirty-rectangle animation: set the start address to
             display page (), draw to page 1, set the start address to
             display page 1, wait for leading edge of vertical sync, copy
             from page 1 to page 0, set the address to display page 0, now
             identical to page 1, wait for leading edge of vertical sync.
             This method makes it necessary to maintain only page and
             eliminates the complications involved with maintaining two
             separate pages.
  -------------------------------------------------------------------------
  Full Text:

  As I write this, the wife, the kid, and I are in the throes of yet
  another lightning-quick transcontinental move, this time to Redmond to
  work for You Know Who.  Moving is never fun, but what makes it worse for
  us is the pets.  Getting them into kennels and to the airport is hard;
  there's always the possibility that they might not be allowed to fly
  because of the weather; and, worst of all, they might not make it.
  Animals don't usually end up injured or dead, but it does happen.

  In a (not notably successful) effort to cheer me up about the prospect of
  shipping my animals, a friend told me the following story, which he
  swears actually happened to a friend of his.  I don't know--to me, it has
  the sound of an urban legend, which is to say it makes a good story, but
  you can never track down the person it really happened to; it's always a
  friend of a friend.  But maybe it is true, and anyway, it's a good story.

  This friend of a friend (henceforth referred to as FOF), worked in an
  air-freight terminal.  Consequently, he handled a lot of animals, which
  was fine by him, because he liked animals; in fact, he had quite a few
  cats at home.

  You can imagine his dismay when, one day, he took a kennel off the plane
  to find that the cat it carried was quite thoroughly dead.  (No, it
  wasn't resting; this cat was bloody deceased.)

  FOF knew how upset the owner would be, and came up with a plan to make
  everything better.  At home, he had a cat of the same size, shape, and
  markings.  He would substitute that cat, and since all cats treat all
  humans with equal disdain, the owner would never know the difference, and
  would never suffer the trauma of the loss of her cat.  So FOF drove home,
  got his cat, put it in the kennel, and waited for the owner to show
  up--at which point, she took one look at the kennel and said, "This isn't
  my cat.  My cat is dead."

  As it turned out, she had shipped her recently deceased feline home to be
  buried.  History does not record how FOF dug himself out of this one.

  Okay, but what's the point?  The point is, if it isn't broken, don't fix
  it.  And if it is broken, maybe that's all right, too.  Which brings us,
  neat as a pin, to the topic of drawing lines in a serious hurry.

  Fast Run-length Slice Line Drawing

  Last month, we examined the principles of run-length slice line drawing,
  which draws lines a run rather than a pixel at a time, a run being a
  series of pixels along the major (longer) axis.  I concluded by promising
  a fast assembler version for this month.  Listing One (page 159) is the
  promised code, in a form that's plug-compatible with the C code from last
  month.

  Your first question is likely to be the following:  Just how fast is
  Listing One? Is it optimized to the hilt, or just pretty fast?  The quick
  answer is:  It's fast.  Listing One draws lines at a rate of nearly 1
  million pixels per second on my 486/33, and is capable of still faster
  drawing, as I'll discuss shortly.  (The heavily optimized AutoCAD
  line-drawing code that I mentioned last month drew 150,000 pixels per
  second on an EGA in a 386/16, and I thought I had died and gone to
  Heaven.  Such is progress.)  The full answer is a more complicated one,
  and ties in to the principle that if it is broken, maybe that's okay--and
  to the principle of looking before you leap, also known as profiling
  before you optimize.

  When I went to speed up run-length slice lines, I initially manually
  converted the C code from last month into assembler.  Then I streamlined
  the register usage and used REP STOS wherever possible.  Listing One is
  that code.  At that point, line drawing was surely faster, although I
  didn't know exactly how much faster.  Equally surely, there were
  significant optimizations yet to be made, and I was itching to get on to
  them, for they were a lot more interesting than a basic C-to-assembler
  port.

  Ego intervened at this point, however.  I wanted to know how much of a
  speed-up I had already gotten, so I timed the performance of the C code
  vs.  the assembler code.  To my horror, I found that I had not gotten
  even a two-times improvement!  I couldn't understand how that could
  be--the C code was decidedly unoptimized--until I hit on the idea of
  measuring the maximum memory speed of the VGA to which I was drawing.

  Bingo.  The Paradise VGA in my 486/33 is fast for a single display-memory
  write, because it buffers the data, lets the CPU go on its merry way, and
  finishes the write when display memory is ready.  However, the maximum
  rate at which data can be written to the adapter turns out to be no more
  than one byte every microsecond.  Put another way, you can only write one
  byte to this adapter every 33 clock cycles on a 486/33.  Therefore, no
  matter how fast I made the line-drawing code, it could never draw more
  than 1,000,000 pixels per second in 256-color mode in my system.  The C
  code was already drawing at about half that rate, so the potential
  speed-up for the assembler code was limited to a maximum of two times,
  which is pretty close to what Listing One did, in fact, achieve.  When I
  compared the C and assembler implementations drawing to normal system
  (nondisplay) memory, I found that the assembler code was actually four
  times as fast as the C code.

  In fact, Listing One draws lines at about 92 percent of the maximum
  possible rate in my system--that is, it draws very nearly as fast as the
  VGA hardware will allow.  All the optimization in the world would get me
  less than 10 percent faster line drawing--and that only if I eliminated
  all overhead, an unlikely proposition at best.  The code isn't fully
  optimized, but so what?

  Now it's true that faster line-drawing code would likely be more
  beneficial on faster VGAs, especially local-bus VGAs, and in slower
  systems.  For that reason, I'll list a variety of potential optimizations
  to Listing One.  On the other hand, it's also true that Listing One is
  capable of drawing lines at a rate of 2.2 million pixels per second on a
  486/33, given fast enough VGA memory, so it should be able to drive
  almost any non-local-bus VGA at nearly full speed.  In short, Listing One
  is very fast, and, in many systems, further optimization is basically a
  waste of time.

  Profile before you optimize.

  Further Optimizations

  Following is a quick tour of some of the many possible further
  optimizations to Listing One.

  The run-handling loops could be unrolled more than the current two times.
   However, bear in mind that a two-times unrolling gets more than half the
  maximum unrolling benefit with less overhead than a more heavily unrolled
  loop.

  BX could be freed up in the Y-major code by breaking out separate loops
  for X advances of 1 and -1.  DX could be freed up by using AH as the
  counter for the run loops, although this would limit the maximum line
  length that could be handled.  The freed registers could be used to keep
  more of the whole-step and error variables in registers.  Alternatively,
  the freed registers could be used to implement more esoteric approaches
  like unrolling the Y-major inner loop; such unrolling could take
  advantage of the knowledge that only two run lengths are possible for any
  given line.  Strangely enough, on the 486 it might also be worth
  unrolling the X-major inner loop, which consists of REP STOSB, because of
  the slow start-up time of REP relative to the speed of branching on that
  processor.

  Special code could be implemented for lines with integral slopes, because
  all runs are exactly the same length in such lines.  Also, the X-major
  code could try to write an aligned word at a time to display memory
  whenever possible; this would improve the maximum possible performance on
  some 16-bit VGAs.

  One weakness of Listing One is that for lines with slopes between 0.5 and
  2, the average run length is less than two, rendering run-length slicing
  ineffective.  This can be remedied by viewing lines in that range as
  being composed of diagonal, rather than horizontal or vertical runs.  I
  haven't space to discuss this, but it's not very complicated, and it
  guarantees a minimum run length of 2.  That renders run drawing
  considerably more efficient, and makes techniques such as unrolling the
  inner run-drawing loops more attractive.

  Finally, be aware that run-length slice drawing is best for long lines,
  because it has more and slower setup than standard Bresenham's, including
  a divide.  Run-length slice is great for 100-pixel lines, but not
  necessarily for 20-pixel lines, and it's a sure thing that it's not
  terrific for 3-pixel lines.  Both approaches will work, but if
  line-drawing performance is critical, whether you'll want to use
  run-length slice or standard Bresenham's depends on the typical lengths
  of the lines you'll be drawing.  For lines of widely varying lengths, you
  might want to implement both approaches, and choose the best one for each
  line, depending on the line length--assuming, of course, that your
  display memory is fast enough and your application demanding enough to
  make that level of optimization worthwhile.

  An Interesting Twist on Page Flipping

  I've spent a fair amount of time exploring various ways to do animation.
  (See, for example, my July, August, and September 1991 DDJ columns, as
  well as those in the January through April 1992 issues.)  I thought I had
  pegged all the possible ways to do animation:  exclusive-ORing; simply
  drawing and erasing objects; drawing objects with a blank fringe to erase
  them at their old locations as they're drawn; page flipping; and,
  finally, drawing to local memory and copying the dirty (modified)
  rectangles to the screen.

  To my surprise, someone threw me an interesting and useful twist on
  animation the other day, a cross between page flipping and
  dirty-rectangle animation.  That someone was Serge Mathieu of Concepteva
  Inc., in Rosemere, Quebec, who informed me that he designs everything
  "from a game 'point de vue'."

  In normal page flipping, you display one page while you update the other
  page.  Then you display the new page while you update the other.  This
  works fine, but the need to keep two pages current can make for a lot of
  bookkeeping and possibly extra drawing, especially in applications where
  only some of the objects are redrawn each time.

  Serge didn't care to do all that bookkeeping in his animation
  applications, so he came up with the following approach (which I've
  reworded, amplified, and slightly modified):

  1.  Set the start address to display page 0.

  2.  Draw to page 1.

  3.  Set the start address to display page 1 (the newly drawn page), then
  wait for the leading edge of vertical sync, at which point the page has
  flipped and it's safe to modify page 0.

  4.  Copy, via the latches, from page 1 to page 0 the areas that changed
  from the last screen to the current one.

  5.  Set the start address to display page 0, which is now identical to
  page 1, then wait for the leading edge of vertical sync, at which point
  the page has flipped and it's safe to modify page 1.

  6.  Go to step 2.

  The great benefit of Serge's approach is that the only page that is ever
  actually drawn to (as opposed to block-copied to) is page 1.  Only one
  page needs to be maintained, and the complications of maintaining two
  separate pages vanish entirely.  The performance of Serge's approach may
  be better or worse than standard page flipping, depending on whether a
  lot of extra work is required to maintain two pages or not.  My guess is
  that Serge's approach will usually be slower, owing to the considerable
  amount of display-memory copying involved, and also to the double
  page-flip per frame.  There's no doubt, however, that Serge's approach is
  simpler, and the resultant display quality is every bit as good as
  standard page flipping.  Given page flipping's fair degree of
  complication, this approach is a valuable tool, especially for
  less-experienced animation programmers.

  An interesting variation on Serge's approach doesn't page flip or wait
  for vertical sync:

  1.  Set the start address to display page 0.

  2.  Draw to page 1.

  3.  Copy, via the latches, the areas that changed from the last screen to
  the current one from page 1 to page 0.

  4.  Go to step 2.

  This approach totally eliminates page flipping, which can consume a great
  deal of time.  The downside is that images may shear for one frame if
  they're only partially copied when the raster beam reaches them.  This
  approach is basically a standard dirty-rectangle approach, except that
  the drawing buffer is stored in display memory, rather than in system
  memory.  Whether this technique is faster than drawing to system memory
  depends on whether the benefit you get from the VGA's hardware, such as
  the Bit Mask, the ALUs, and especially the latches (for copying the dirty
  rectangles) is sufficient to outweigh the extra display-memory accesses
  involved in drawing, since display memory is notoriously slow.

  Finally, I'd like to point out that in any scheme that involves changing
  the display-memory start address, a clever trick can potentially reduce
  the time spent waiting for pages to flip.  Normally, it's necessary to
  wait for display enable to be active, then set the two start address
  registers, and finally wait for vertical sync to be active, so you know
  the new start address has taken effect.  The start-address registers must
  never be set around the time vertical sync is active (the new start
  address is accepted at either the start or end of vertical sync on the
  EGAs and VGAs I'm familiar with), because it would then be possible to
  load a half-changed start address (one register loaded, the other not yet
  loaded), and the screen would jump for a frame.  Avoiding this condition
  is the motivation for waiting for display enable, because display enable
  is active only when vertical sync is not active and will not become
  active for a long while.

  Suppose, however, that you arrange your page start addresses so that they
  both have a low-byte value of 0 (page 0 starts at 0000h, and page 1
  starts at 8000h, for example).  Page flipping can then be done simply by
  setting the new high byte of the start address, then waiting for the
  leading edge of vertical sync.  This eliminates the need to wait for
  display enable (the two bytes of the start address can never be
  mismatched); page flipping will often involve less waiting, because
  display enable becomes inactive long before vertical sync becomes active.
   Using the above approach reclaims all the time between the end of
  display enable and the start of vertical sync for doing useful work.
  (The steps I've given for Serge's animation approach assume that the
  single-byte approach is in use; that's why display enable is never waited
  for.)

  Thanks

  I took another bundle of reader contributions from the X-Sharp careware
  over to the Vermont Association for the Blind the other day.  They were
  very grateful.  Thanks to all of you who have helped so far.

  [BACK] Back
Discuss this article in the forums
Date this article was posted to GameDev.net: 7/16/1999
(Note that this date does not necessarily correspond to the date the article was written)
See Also:
Michael Abrash's Articles