March 2009 – Marlena's Blog

What is Data Visualization: Part 1 of 2 Characteristics of Excellent Visualizations

In this post, I will be answering the question, “what is data visualization” and writing about some of Edward Tufte’s principles of for “excellent” data visualizations. This can be an aid in creating better graphs or in looking at graphs. In a subsequent posts, I will relate these fundamental principals to visualizations for use in software testing.

In his first book, The Visual Display of Quantitative Information, Tufte outlines several principals for use in the creation and interpretation of quantitative graphics. If you get the chance, I highly recommend flipping through it. If you have questions about the statistics concepts, you might want to look at Head First Statistics by Dawn Griffiths. I’ve been hitting this book up regularly especially for the metrics class I’m currently taking.

In the comments of my post “Exploring Data Visualization,” Eric asked me, “what is data visualization?” When I say data visualization I’m talking about a graphical depiction of statistical information that tells a story. These depictions can be simple or more complex, and they all have a point they are trying to make. According to Edward Tufte an excellent visualization expresses “complex ideas communicated with clarity, precision and efficiency,” (13).

To illustrate this have a look at one of my favorite interactive web graphics. “A Year of Heavy Losses,” from The New York Times. It illustrates the change in market capitalization of banks from 2007 to 2008. Be sure you click on the square at the top left to see the change. You can see not only the number of banks dwindling, but also their capitalization in the market. You can also mouse over each bank to see more granular data.

According to Tufte, these are some characteristics of excellent visualizations:
1. Lots of numbers packed into a tiny space
2. Data represented is not distorted
3. Extremely large data sets have coherency
4. Comparison between different pieces of data is easy
5. Data is revealed at a micro level and at a macro level
6. The data’s purpose is clear
7. Integration between the statistical and verbal descriptions of the data is tight

Here is an illustration Tufte uses as an example:

It is a French train schedule from the 1880’s. Take some time to look at it and understand it, then look back at the characteristics I have just listed. Did you notice how the cities on the left are not listed at regular intervals? This is because Marey spaced them apart proportionately to their actual distance from each other. Since he did that, when you look at the slope of the lines, you are not only seeing arrival and departure times, but also the relative speed with which the train will get you from one place or the next. If you depend on trains to get you from one place to the next, this can be very important information.

This graphic also illustrates the concept of multivariate data which, according to Tufte is also a quality of excellent visualizations. I’m going to break out what’s in the train illustration into univariate, bivariate and multivariate data. If I miss something, just add a comment.

Let’s start with the concept behind this illustration. It’s depicting arrival times and departure times of trains in France. It shows the route the trains take, and the relative speed with which they make from one station to the next.

Univariate data shows the frequency/probability of one variable.
Some univariate data from this graphic: the number of trains arriving or departing a station. The number of trains arriving at stations at any one time. The number of arrivals at a station each day. The number of departures from Chagny station each day. Each of the variables I have described is a frequency (Head First Statistics 609).

Bivariate data shows 2 values for an observation.
Bivariate data from this graphic:
(x) Time of day
(y) Number of trains arriving/departing at Chagny station

For this observation you need two variables(Head First Statistics 610).

Multivariate data shows multiple values for an observation.
If we take the observation from the bivariate data example and add stations, the observation becomes multivariate and is what you see in Marey’s illustration.

I’ve just covered a lot of material and I hope it gives you a good idea of what data visualization and the field of information visualization is all about. In my next post, I’ll be covering the ways in which graphs can lie. I’ve seen this happen at work and just completed a reading assignment for school where it was also an issue. These are complicated topics that software engineers should understand if they are to use visualization in ways such as a tester’s heads up display.

Questions and comments are always welcome.

Exploring Data Visualization

For the past few months I’ve been obsessively learning about data visualization so I’m posting about my exploration with links to everything (books, blogs, graphics, people, etc.) This topic fascinates because it brings together all of my studies including art, art history, theatrical design, computer science and software engineering.

Last fall, I found the book Visualizing Data: Exploring and Explaining Data with the Processing Environment by Ben Fry. I can’t remember how I found it. Maybe it was in the O’Reilly email of new titles. Since I work at a credit reporting agency, there is no end to the data. It seemed like the perfect opportunity learn about graphics, so I started typing out Fry’s examples and then applying them to my data. Fry is one of the creators of a graphics library called Processing which uses java. This made the examples pretty easy to understand. I’m not finished with this book yet. The examples get more and more challenging the further you go, but the author seems to enjoy interacting with his readers and wants people to have a positive experience with his code.

So last Fall, I was having fun with these examples, and then I went to GTAC. I know I’ve already written about James Whittaker’s keynote, but just bear with me. Seeing how transfixed the crowd was with the few data visualizations he uses for testing, I felt something click in my head. There aren’t many moments in life when we get total clarity, but I finally had a huge one and decided not to let it go.

Before, I had just been playing with data visualization and happy that it fed my artistic side, but now, I was in it for keeps. When I came home, I looked at the other books that Ben Fry was referencing and found the ultimate classic of data visualization. If you only ever read one book about this topic, that book should be Edward Tufte’s The Visual Display of Quantitative Information, 2nd edition. After I read this book, I had a talk with my thesis advisor and decided to do a thesis on data visualization and software testing.

Please note that Tufte will not tell you what type of graph to use in any particular situation. For that, I turned to Head First Statistics. It goes over this in Chapter 1 and is the most accessible statistics book I have ever read.

Since blogging is such a great font of information, I went out looking for blogs and found several that I really enjoy. There are definitely other worthwhile blogs on data visualization.These are the ones I’m reading regularly:

Jorge Camoes’ Charts
Visual Business Intelligence (Stephen Few’s blog)
Excel Charts and Tutorials by Peltier Technical Services
Information Ocean

A few weeks ago, Edward Tufte offered a seminar in Atlanta, and I was fortunate enough to go. It’s pricey, but you get all four of his lovely hardback books are included which somewhat offsets the cost of admission. I found some excellent notes that were taken a few days later in Raleigh on Justin Wehr’s blog. I could tell that Dr. Tufte had given his prezo a few (hundred) times, but seeing him present his material provoked some really deep thinking. When the presentation was over, I walked to a bench in the hotel lobby, and put together the bones of my thesis. Visionaries such as Dr. Tufte always inspire my best thinking.

Currently, I’m reading through lots of research papers about the visualization of source code. I’ll make a separate blog post for that. Well, there might be several separate blog posts for that. For the first time in my life, I feel completely engaged in what I’m doing at work and at school.

A few more Vampire Testing Lessons

: Image via Wikipedia

I just love it when bloggers mix pop culture with testing. Recently, the Testy Redhead posted a few lessons about testing she adapted from her reading of the Twilight books. I love these books in all their somewhat-poorly-written-but-ultimately-addicting glory so I couldn’t help but put together a few lessons of my own. The last lesson is a spoiler, but I’m guessing that there are not hordes of Twi-hards reading this blog.

You can build a working car from a bunch of disparate parts.
Jacob totally rebuilds a Volkswagon Rabbit using parts that he collects over the length of a couple of the books. This reminds me of some of the great open source tools that are now available for testing such as Bugzilla and Selenium. It also reminds of the Automated System Test Framework I’m building from scratch at my job. I started out with a bunch of short scripts that I wrote, but, with some perserverance I’m close to having a system in place that will greatly assist me in testing.

Sometimes the yellow Porsche really is what you need.
I’ve noticed a real disdain for expensive tools among testers, but sometimes they are the right answer the same way the yellow Porsche was the right car for Bella and Alice in New Moon. When I started my testing job, I was not a tester and I did not know what I was doing. My tester friends in another group had shown me HP Quality Center, and I realized that I desperately needed this assistance with test case management. It helped me transition off of spreadsheets and gave me a structure for repeatable testing.

Don’t read the last one if you don’t want to read the spoiler.

Testers ARE the Shield
In the last book, Bella protects everyone using her special super power. Testers also have a super power, and that power is the right to say, “This product really stinks and is not ready to be released with our team’s name on it.” This is not the most obvious power, but it can protect a team or even a company from releasing a product and regretting it. I’ve had to say this before to a most “busy and important” developer who let me know how busy and important he was, but I knew that I was protecting consumers by saying it.

Is complexity really all about the source code?

At this point, the work I’m doing on my master’s thesis is starting to congeal. I’ve been reading about data visualization and how it can assist quality. There seem to be several levels at work. There is the source code level where developers and, to some extent, qa examine individual pieces of code. This level is addressed by unit tests generally written by a developer or specialized white box tester. Higher up is the system test level which may or may not be automated. Recently, frameworks and tools such as STAX/STAF and Selenium have helped to automate and provide more consistency for some level of system test. At the highest level, quality is less about tests and more about metrics, in particular, lines of code.

In my research, I have found many papers about providing analysis for source code. There are also plenty of papers which address the production of metrics. Software and User Interfaces have developed to the point where the idea of zooming is going to bring these layers together. If you think about Google Maps and the zooming capability that it has, think about how that type of zooming could be applied to the Software Development process. James Whittaker, the architect of Microsoft Team Test certainly has. His vision is that this type of interaction will result in a Heads Up Display for quality.

Assuming that this is where we are going. I can see some challenges in how projects are organized. The main challenge will be mapping from one level to the next. Moving from a system test level to a unit test level will highlight how system tests are mapped to unit tests. This goes back to requirements and design which I know is still an issue at not all but plenty of software shops. Apart from unit to system test mapping, there is also the issue of how LOC is examined and how that maps to source code and maybe even system tests. What if two developers have worked on the same code? How would management assess the LOC count?

I know that, technically, system tests, unit tests and source code are all supposed to be organized based on requirements, but some of us live in the real world!!!! I know that there are places where requirements are maybe not great, but usable. That said, there are plenty of testers who don’t work in shops where we have the benefit of usable requirements or they may be usable in a very loose sense of the word. If a tester is working in a group with bad requirements, they are still responsible for the bugs they didn’t catch.

In this case, complexity is not just about source code anymore, but about how well tests cover the system.