Thursday, March 24, 2011

Structure Big Data Conference


Wednesday (23 March) was a good day to be at a big data conference.  Weather.  Climate.  Fire.  Land Use.  Remote Sensing.  Building Models.  All of these topics were woven into the conversations throughout the day; some in great detail.  Certainly these topics were not what I was anticipating when I was coming to the GigaOm Structure Big Data Conference at historic Chelsea Piers.  But since I have an interest in all of these topics with a data-centric focus, I was more than pleased.  Despite being late March, a dreary day replete with cold rain and snow, and an unpleasant encounter with a miserable commuter on the bus into NY in the morning, the day brightened once the conference started, and it became well worth the suboptimal start to the morning.

Many other posts will describe all of the sessions of the event in detail.  From start to finish, the entire day was filled with good talks, but there were certainly a couple that stood out.  Two happened to be the first panel, comprised of Terry Jones (Fluidinfo), Hilary Mason (, Bill McColl (Cloudscale) and Bassel Ojjeh (nPario), followed by a keynote by Jeff Jonas (IBM).

Put the Fluid into Data

We have all heard the quote by Hal Varian, Google’s Chief Economist, stating that statisticians will be in top demand through the next decade.  That is fine, but having data manipulation and statistical skills are only part of the equation.  The other part is the fluid part - the creative aspect that Jones stressed more than once during the first panel.  He noted that it is easy to get hung up on hierarchy, structure and compatibility.  Important concepts indeed, but the true data scientist, or using the now-preferred term data ninja, needs to use both sides of the brain.  Jones’ comments reminded me of a course that I used to teach on the art and science of forecasting (later changed to the more formal decision making under uncertainty).  While the course was grounded in demonstrating the tools and techniques surrounding how to use (and not to use) statistical analysis, the true benefit of the course came out when students added the element of subjectivity, where they were forced to communicate the results of their analysis. 

The New Physics

This panel was followed by a wonderful overview of bigdata by Jeff Jonas of IBM, with a memorable introduction.  Jonas talked fast and discussed a lot, but what I took from his talk is the underlying importance of context.  With Moore’s Law at work regarding the amount of data that is generated, data scientists have no choice but to let algorithms do some of the work.  With the amount of data that is coming available from all disciplines, we don’t even know which questions to ask.  Machine driven data mining will allow researchers and analysts to formulate better, more relevant questions - this can lead to tangible products that accompany a more direct route to market.  Jonas described this as letting the data ‘find’ other relevant data, which makes perfect sense.  This has obvious relevance in the spaces of two fields within the earth and environmental sciences: environmental early warning systems, and environmental metagenomics, where sensors, stations, and satellites have created, and continue to create, a real-time planetary central nervous system, tracking past and current global biogeophysical conditions.  Putting the pieces together is the next grand challenge in this space.  (see work of Jose Achache).  Finally, the talk that I wanted to hear the most did not disappoint.  Alfred Spector of Google delivered a timely overview of ‘The Prodigiousness of Data’, again with many references to applications in the earth and environmental sciences, where freely available research grade data can be imported from multiple sources, and developed into a scientifically defensible thesis.  Advances in the availability of both data and tools which allow users to create meaningful visualizations via maps and supporting graphics, can condense reams of data into a simple and effective message.  His talk also touched upon some of the same themes highlighted by Guardian’s Simon Rogers, at the O'Reilly Media Strata2011 Conference from last month.  Given one of Spector’s examples which highlighted the retrieval, processing and presentation of multispectral satellite land cover imagery resulting in data-rich maps, I expect to see more of this hybrid approach spurring science over the next couple of years.  It is worth noting that even at mainstream scientific conferences (particularly the American Geophysical Union events), attendees are now more likely to see data in presentations and talks supported by free tools embedded in Google Earth and R.  I’ll be spending some time with Google’s fusion tables this weekend as well.

Looking Forward to Next Year

Overall, it was a long day, but one which provided a great deal of fodder for future applications.  Some of the more memorable quotes of the day which carried personal significance:

  • Weather forecasting is the original big data problem
  • Big Data...New Physics
  • Most data analysis software is based on algos that are 30/40/50 years old
  • R is the statistical language of the future
  • We need an iTunes for data

Thanks to the Om Malik and the entire GigaOm team for putting on a wonderful event, and I will be looking forward to future events with the Big Data theme.

No comments: