Solving the increasingly complex big data analytics challenge

By Chirag Dekate | 22 July 2015

The role of storage plays a key part in data capture, algorithmic trading, risk management, and fraud detection, and as the volumes of data continue to grow, the approach to storage infrastructures needs to change. DDN’s Chirag Dekate discusses the challenge faced by financial services organisations in capitalising on big data and analytics and offers his thoughts on how to maximise investments in technology to get results faster.

We’ve worked very closely with a number of financial services firms and trading houses - over 40 percent of the world’s largest trading companies rely on our technology - and the need to always be one step ahead is the key driver for investment in new technology. Big data and analytics, in particular, are what’s driving this investment and so financial services firms are always on the lookout for innovations that can help them maximise returns on investment and gain time to market advantage over their competitors. 

The challenges they’re facing are becoming increasingly complex; data capture, algorithmic trading, back testing, risk management, pricing and fraud detection, and competitiveness depends on speed of analysis and volume of data - larger data sets increase the likelihood of more accurate analysis.

The volume of data being collected and queried is continuing to grow – now into the terabytes and petabytes. This prodigious growth doesn’t just affect the financial services industry however, almost every sector in every vertical is facing similar challenges. However, financial services firms have always been at the forefront of big data analytics – it is, and has been for a long time, the cornerstone of a successful FS organisation.

The challenge isn’t just about storing the masses of data and the costs associated, this is a problem solved long ago, the challenge is around how to effectively take advantage of big data using analytics to derive bottom-line results.

Data sources such as additional trading venue for best execution, news feeds for sentiment analysis, cross product customer databases and more are being added all the time. When the best result depends on having the optimal algorithm consuming the most relevant data, the ideal approach is to be able to analyse more data, faster. Providing faster time to results with richer data inputs yields answers that much more accurately reflect the market dynamics than smaller sample-data models of the past.

The storage bottleneck

The old adage of “you’re only as fast as your slowest component” holds true in big data analytics. From the late 1980’s to present there has been a massive leap forward in processing and compute environments. As these environments continued to grow, organisations were starting to suffer from a bottleneck in their storage infrastructure.

Some of the organisations were utilising legacy network attached storage (NAS) architectures to manage their terabytes and petabytes of market datasets. However this is a fundamentally mismatched technology which often results in Input/Output (I/O) bottlenecks, at the cost of application performance because applications are not able to access the relevant data fast enough. Critical financial applications utilising these NAS architectures often experience 3X – 5X longer time to solution. Additionally, this lack of relative performance in storage technology compared to the compute performance was slowing financial services firms’ ability to fully benefit from the investments in the compute power used for big data analytics.

Financial services organisations achieve competitive advantages by bringing to bear successful new strategies, while phasing out less successful ones in real-time. That means your infrastructure needs to support projects in the petabytes and GB/s of sustained I/O – i.e. the speed of storing and recalling data. Imagine, if you will, your desktop computer; when you open a document you are asking the machine to locate the data pertaining to that document on the hard disk drive, read the data and present it back to you as a word document. Whilst this is certainly over-simplifying I/O, the action remains the same, except that you’re asking your machines to store, recall and analyse data in the terabytes and petabytes, and you need this to be done in fractions of a second.

Beyond the storage hardware is the next possible bottleneck, the file system. Band-aiding more traditional file system approaches to storage infrastructures that are already at the limits of the amounts of data being analysed isn’t going to help. Using fast, scalable, external disk systems with massively parallel access to data allows FS firms to take advantage of much larger data sets. 

Parallel file systems give the ability to scale upward to support ever-increasing workloads. Leveraging the latest advances in innovative parallel file systems architecture, organisations can eliminate the single file server bottleneck of Network-Attached Storage (NAS) systems and allows organisations to automatically spread the workload of moving files from disk to memory evenly across all system resources.

By removing the bottlenecks, applications can access datasets faster by fully exploiting the inherent concurrencies enabling higher throughput I/O and as a result are able to deliver unprecedented time to solution. These performance and throughput advantages scale linearly with the data set sizes, enabling financial services organisations to analyse larger data sets faster, thereby improving trading algorithm accuracy.

This isn’t an entirely new approach, for the past several years visionary hedge funds, proprietary trading firms and other financial institutions have been changing their infrastructure to take advantage of parallelism in order to analyse more positions faster and develop more effective trading and risk management strategies – which they are deploying in much less time.

Storage defined by software

If the storage bottleneck has been solved, and as more firms take advantage of parallelism in storage, the question really is, what’s next. If you can scale parallel file systems up and out efficiently and effectively, what other improvements can be made to help capitalise on big data growth? 

Traditionally, storage systems are sized based on anticipated peak performance requirements, not sustained requirements – which result in a lot of unused or underutilised bandwidth and capacity. So, what if you could take the heavy I/O lifting and virtualise it? That is to say, using software defined burst buffer to manage and control the flow of data in and out of storage.

If a software defined burst buffer could perform all the heavy I/O lifting while leveraging parallel file systems behind for a persistent storage layer, you are potentially opening up a 1000x performance increase with your current infrastructure. It would mean you wouldn’t have to purchase more hardware to meet the growing data and analytics demands.

This decoupling of performance and capacity enables you to eliminate overprovisioning of compute and storage resources just for peak bandwidth needs. Your compute is a major investment, so maximising the amount of time spent processing and minimising latency and idle times are the key to optimising the return on that investment and delivering faster results.

An intelligence layer in front of persistent parallel storage can, in many instances, actually give one third more data processing availability in your current infrastructure, and it can mean 70% less hardware is needed to reach the performance requirements you’re looking for.

There’s simply no denying that in order to capitalise on big data an investment in storage and compute is needed, but what I advise our customers, prospects, and you, is to look at the new innovations in storage. Throwing more storage hardware at the problem seems to work on paper, but eventually this overprovisioning solution encounters increasingly diminished returns and excessive, hard to manage overhead at scale, not to mention it being very costly. Using parallelism, intelligent software defined burst buffers, and the latest in persistent storage technology you can process more data in less time giving you the competitive advantage.
 

By Chirag Dekate, Sr. Manager Vertical Markets at DDN

Become a bobsguide member to access the following

1. Unrestricted access to bobsguide
2. Send a proposal request
3. Insights delivered daily to your inbox
4. Career development