Data Management Infrastructure

I hate coining buzzwords.  And maybe I didn’t even coin this one.  But we need some phrase to describe the following problem:

Data now comes in two processable flavors: structured and unstructured.  And stacks now exist for processing either flavor.  But each world is undergoing transformation.  And how the two worlds will be combined is up in the air.

Unstructured data: whether or not the Hadoop ecosystem is The Answer, there is vigorous experimentation with how to work with massive amounts and velocities of unstructured data, and there are some emerging norms.  Other NoSQL approaches remain as alternatives, and there will probably be use cases for almost all of them.  We are not even near the end of the beginning when it comes to defining how unstructured data systems will interface with applications (where is the NoSQL SQL?), and we are IMHO still at the very beginning of understanding what storage systems are optimized for these workloads.

Structured data: with NewSQL databases, it is clear how to interface them with applications but far from clear how they work with storage systems, particularly SSD-based storage systems.  Jury is out as well on how to multiplex the different databases in a use case.

I call of this “data management infrastructure”, and it seems to me like an emerging big design problem.

Thoughts?  Who’s working on this?  Where should we invest?