Share |

Big Data Notes 009: In-memory

Among the ever expanding echelons of the world’s collective big data taskforce, there are innumerable diverging schools of thought on the best way to operate, in no small part because the overarching development covers such a broad range of applications. Here Big Data Notes explores in-memory.

In-memory? Big data isn’t dead already is it?

No, far from it. In-memory is no eulogy. It’s an option for the storage, access and analysis of data which is primed for ‘big data’ since it offers faster and more predictable response rates.

Instead of storing data on a disk that is separated from the main CPU of a computer, as is the norm, an In-Memory database system uses the Random Access Memory (RAM), which is the first port of call for the CPU.

I think I may have heard of this elsewhere. Does it have any aliases?

Imaginatively, some know it as main memory database system, or MMDB.

Very good. So what’s the lowdown?

Essentially, performing analysis on structured data is a lot quicker when the data is in-memory. This is because it does not have to be bound by the restrictions necessary with secondary storage which seriously impair performance.

In a secondary disk-based storage where databases are traditionally held, the data is stored in tables and cubes which also include lots of meta-information that is ostensibly supposed to make calculations easier when the database is loaded and a user makes a query. These are known as ‘materialised aggregates’. However, this only works while the database remains relatively small. When you have huge databases, this hold-up will stop the query in its tracks; it would simply take too long to read all of the different tables individually and piece it all together. The results will then not be relevant or further analysis will be equally arduous. However, when the in-memory is used, the data is stored vertically, in columns, which effectively segregates data, meaning much less metadata is needed, and, furthermore, the data is easier to compress and therefore quicker to scan.

Finally, as the data is stored in columns it is supposedly easier to distribute for parallel computing, meaning more processing cores can be easily applied to the task.

All of this means the results of an analysis query can be served up in a cat’s whisker away from real time, allowing the recently crowned coolest cats of the enterprise, the data investigator, to fill their winklepickers.

The technical definition is ‘real-time online application processing (OLAP) analysis using an online transaction processing (OLTP) data structure.’

So why has the Einstein moment only come now?

Like with many facets of big data and indeed with any new technology, a number of factors have come together over the last few years to enable data to be analysed in random access memory. One is that the cost of RAM has come down significantly. The barriers on the power of the technology have also been lifted – or at least raised significantly – with radically increased numbers of processing cores and cache sizes available on common-or-garden computing systems.

So this is finally the one-stop shop answer to all of our big data problems then? Hurrah.

Nah, t’aint, sorry. As we will see from some of the leading suppliers in a bit, you can only deal with a few terabytes of data at a time in-memory. As far as the volume bit goes, that’s pretty big, but I’ve seen bigger. In fact, some organisations have databases that run into petabytes – that’s thousands of terabytes. This won’t help in that scenario.

There’s also the variety issue. In-memory alone doesn’t allow you to work with unstructured data any more than disk-based storage does. You’ll still need specialist tools like Hadoop and non-relational databases for that.

So who is pushing in-memory as a solution and what do the offerings involve?

Those old sparring partners Oracle and SAP are the most prominent. Oracle has release its Exalytics In-Memory machine while SAP pushes its Hana platform. Both are essentially designed to replace existing data warehouses and come packaged with a mixture of hardware, software and links to the companies’ existing range of BI applications.

The Oracle solution offers a complete end-to-end system, with a single-server ‘In Memory Machine’ with 3.6 terabytes of storage capacity which the company claims to be the ‘industry’s first engineered in-memory analytics machine that delivers no-limit, extreme performance for Business Intelligence and Enterprise Performance Management applications’. As well as this piece of hardware, the package comes with software which includes Oracle’s ‘TimesTen In-Memory Database’ – an SQL-based relational database which is specifically geared up to work within in-memory and apparently offers ‘real-time data management that delivers blazing-fast response times, and very high throughput for a variety of workloads.’

The proposition is that the data which a company is performing analysis on at any given time can be moved over to in-memory in order to provide faster analytics (up to 20 per cent quicker, according to Oracle).

Being a software company first and foremost, SAP’s package doesn’t come with its own server, but the company has worked closely with a number of hardware producers to build ‘partner certified’ hardware. Public cloud providers also offer Hana-optimised environments.

As with the Oracle system, SAP provides a proprietary in-memory database tool which links to its existing portfolio of BI apps, only optimised to run in the in-memory format.

The SAP system differs from Oracle’s in that entire databases can be stored in-memory, instead of just the ‘hot’ data, as with the Oracle system. Its storage capabilities are therefore larger – supporting up to eight terabytes of data or 40 in compressed format.

Eventually, all existing SAP database implementations will be expected to transfer to the Hana platform, although this migration is expected to take some time.

Where can I find out more?

Get in touch – mark.young@bigdatainsightgroup.com – we’ll link you up.