Musings from the World of Consulting

  • Random
  • Archive
  • RSS
  • The consultant is in
  • What's on your mind?
'\x3ciframe width=\x22500\x22 height=\x22303\x22 src=\x22http://cdn.livestream.com/embed/gigaombigdata?layout=4\x26amp;clip=pla_640d50f2-558c-4cf5-b3e3-41e7c7791990\x26amp;color=0x000000\x26amp;autoPlay=false\x26amp;mute=false\x26amp;iconColorOver=0xe7e7e7\x26amp;iconColor=0xcccccc\x26amp;allowchat=true\x26amp;height=340\x26amp;width=560\x22 style=\x22border:0;outline:0\x22 frameborder=\x220\x22 scrolling=\x22no\x22\x3e\x3c/iframe\x3e'

This was a great discussion on how established leaders in their markets (e.g., Allstate in the Insurance Pricing arena) are using competitive platforms (in this case Kaggle) to bring new algorithmic concepts and analytical approaches to solve classic problems. 

Allstate’s competition was discussed and some interesting perspectives shared.

Eric (from Allstate) mentioned that some of the winning algorithms would have been too complex to explain to Allstate’s customers and thus may not have been suitable for practical implementation. That statement is partially true, as most insurance carriers are using GLM and derivative models whereas the public’s general comprehension ends with simple regression. As such, the causality of higher premiums may be explained in generic terms, the actual contributors are usually deemed trade secrets. The ‘disconnect from the model’ or ‘black box model’ perception is going to increase over time as complex algorithms (sometimes referred to as ‘machine learning’) begins to gain prominence in more consumer facing interactions. 

Another point that was tangentially discussed was the inadvertent over-fitting of the submitted models to the sample data set. This is not surprising given the competitive personalities engaged on Kaggle, as noted Jeremy (Kaggle’s Chief Scientist). The takeaway is that data preparation is paramount and the criteria used to benchmark need further scrutiny. After all, the teams participating are astute and whose deep technical expertise will be focused on winning as per definition. In some cases, the definition may unintentionally deviate from the perceived objectives.

Finally, Jeremy raised an interesting point on how the best models are those from outside the subject domain. Existing benchmarks are set by those familiar with the subject domain and thus well versed in the conventional ways. For a ‘game changing shift’, a radical approach is necessary and generally occurs when applying patterns from other domains/fields.

It is going to be interesting as new algorithms and tools are leveraged in established industries. Though not a big fan of ‘payday loan companies’, the founder of ZestCash was interviewed at the same conference in terms of how his company is using thousands of attributes coalesced into 10 models to better rate and underwrite loans to the underbanked. This is quite revolutionary where the ‘industry benchmark’ is a regression model developed in the 70s using 12-15 criteria  and commonly referred to as FICO.

    • #bigdata
    • #analytics
    • #GLM
    • #insurance
    • #Allstate
    • #machine learning
  • 1 year ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+
Interesting tidbits of information - Though cloud computing (assuming catch all for IaaS, PaaS and SaaS) feature prominently on CIO agendas, analytics (including Big Data, Visualization) are not that prominent. 
Perhaps application modernization is the support for application accessibility via mobile / tablet devices through the use of a service oriented architecture, where existing applications are can be invoked by disparate presentation layers. If that is the case, it is really the rise of SOA, though with a refreshed business case. 
Pop-upView Separately

Interesting tidbits of information - Though cloud computing (assuming catch all for IaaS, PaaS and SaaS) feature prominently on CIO agendas, analytics (including Big Data, Visualization) are not that prominent. 

Perhaps application modernization is the support for application accessibility via mobile / tablet devices through the use of a service oriented architecture, where existing applications are can be invoked by disparate presentation layers. If that is the case, it is really the rise of SOA, though with a refreshed business case. 

    • #SaaS
    • #bigdata
    • #CIO
    • #infographic
  • 1 year ago
  • 3
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Big Data Bug Bites GE

GE is taking a leaf from IBM’s playbook in establishing an internal Center of Excellence (COE) to act as a catalyst / accelerator within its vast organization for a cloud centric approach. As one of its rivals, Rolls Royce, has demonstrated the margins of pre-emptive maintenance exceed the initial sale of the aircraft engines.

It will probably take GE 2-3 years to demonstrate a visible impact, though it is an affirmation that the age of network connected industrial equipment is here.

    • #bigdata
    • #cloud computing
    • #gpu
  • 1 year ago
  • 3
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+
Every minute, 925 iPhones are sold and 1,820TB of data is generated!
Pop-upView Separately

Every minute, 925 iPhones are sold and 1,820TB of data is generated!

    • #iPhone
    • #bigdata
  • 1 year ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Is R going to displace Excel as the defacto modeling tool in modern business?

5-6 year seems like an eternity. Though many may not realize it, there have been huge strides made in technological horsepower that is available at our fingertips.

For instance back in 2005, this would have been deemed as a high end workstation: 

Pentium 4 @ 3.2GHz w/2Gb of DDR RAM (and if necessary 10k rpm HDD)

In fact, the predominant usage of such machines would have been in the graphics department, where Adobe’s suite would make use of all the floating point and multiple cores it can get. 

Oddly, enough those, days all analytics, leveraging likes of SAS or SPSS, would be run on UNIX servers or as in the case of large clients, on old iron (aka mainframes). Sure they had PC versions of their flagship programs, though those would be used for algorithm development (akin to expensive IDEs) and serious work or production usage happened on traditional servers. 

So what has changed? 

Couple of things:

  1. Computing power has come a long way. There was a presentation made by EMC at one of its keynote speeches which captured how far things have progressed. Since 2005, there has been 20x (that is, twenty!) fold increase in computing power. Admittedly, one has to be using software optimized to make use of the cores through parallelism and other tricks. 
  2. Availability of open source solutions like R and their use in universities has essentially fueled the usage of advanced statistics and mathematical constructs like never before. In most corporate environments, the analytics tool tends to be Microsoft Excel and the odd plug-in. Now, for experimentation and investigation, all one needs to do is download and use R. Though there are gestapo like IT policies inforce in most corporations, R is fairly self contained and do not require the user to have administration privileges. In certain analytics heavy professions such as actuarial sciences, there are textbooks on usage of R for common problems.
  3. Advent of cheap consumer grade SSD has changed the equation more than people realize. Popularized by Apple through the Airbook, it allows one to have the IO throughput in their laps that most enterprise SANs struggle to achieve (aside from critical Tier 1 application, most NAS shares for office use are slow…). This aids in the integration and loading of disparate data sets, while aiding prototyping by reducing the time for analysis runs. Most analytics related jobs are bound by the IO time taken, rather than CPU time in the initial development phase. 
  4. Through the efforts of Amazon and others, IaaS (Infrastructure as a Service) has matured to a point where users, when the time is right, can tap into HPC power and run applications like R, Datameer on a scale that most corporations would struggle to match using internal resource constraints. And given the sporadic need for such resources, it would be questionable as to the business rationale, especially if one can lease those from a third party like Rackspace or Amazon on an as needed basis. 

Nowadays, most desktops and laptops easily have 4Gb RAM if not 8Gb. Coupled with SSD drives, one can easily build fairly comprehensive prototypes using tools like R. Once ready to deploy, it is only a matter of leveraging Amazon EC2 like infrastructure to gain the scale in terms of computational power.

All this allows one to use their desktop or high end laptop to build and validate complex models using data sets that a few years ago would have needed formal IT support and infrastructure. 

It will not be surprising to see in the next few years increasing sophistication and advancements in how we use data to support business decisions with models aiding as lens to shape our perspectives. This would also imply higher expectations of those providing such organizations advise and expert opinions. Clients will begin to expect analytics infrastructure to support comprehensive data sets that consulting firms can leverage to enrich their internal information to develop insights and models for ongoing inference. Tools like R will become the norm in businesses just as Excel today, is the defacto lingua franca for business models. 

Are you ready?

    • #bigdata
    • #analytics
    • #R
    • #IaaS
    • #Actuary
  • 1 year ago
  • 24
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Advanced Analytics and Transaction Monitoring in Money Laundering Enforcement

I am currently attending the annual ABA / ABA Money Laundering Enforcement Conference in Washington DC.

An interesting theme this year is the validation of the monitoring systems and ‘models’. It is interesting that regulators are building the talent and wherewithal to validate models developed by financial institutions (mostly banks) to detect anomalous behaviour warranting further investigation.

Though most such models are really fancy means of pattern classification and peer benchmarking, it will not be surprising to see some institutions having to address weaknesses. Ironically, it is not going to be some complex Bayesian inference logic, but more mundane (though much more difficult to resolve) issue like data from source systems.

In addition, the regulatory bar is rising, with regulators expecting testing to consider the black box nature of some of the vendor offerings and seeking details on the logic behind the rules (for the ‘white box’ versions).

It will not be surprising to see that the supporting IT departments will be expected to build processes and enhance their methodology to test, validate and enhance models.

In addition, once the foundational issues of data sourcing and quality are surmounted, it would not be surprising to see sophistication of models to increase. The evolution would be similar to that witnessed in the insurance marketplace with pricing and rating models.

Interesting times, indeed.

    • #money laundering
    • #validation
    • #BigData
    • #anal
    • #analytics
  • 1 year ago
  • 5
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Impact of Big Data on Infrastructure

In an earlier post, Big Data was discussed and defined. In short, Big Data is driven by two primary drivers; “computer generated data” and incorporation of “unstructured” data.

So some of the emerging trends in networking and storage are being indicative of the impact of Big Data, emerging computing trends such as Unified Communications and other collaboration enhancing technologies. 

For now let’s consider network perspective: 

Traditionally, networks had been designed to accommodate traffic between the desktops / workstations in the enterprise and the core transactional systems residing in the data center(s). Though the computing paradigm has evolved from mainframe - terminals, client server, n tier web and now what may be best described as a hybrid cloud strategy, the flow of data has been fairly consistent. 

Networking experts would call that ‘north-south’ where north is synonymous with end clients and south is typically where the core platforms reside. As computing density has increased and collaboration enhancing technologies become more common, the data flow between the end clients (‘east - west’) is becoming increasingly larger in volume and if trends continue, will exceed ‘north - south’.

This has several implications - most of the routing investment and core networking infrastructure needs to be upgraded and enhanced to accommodate traffic that probably was not anticipated when the core network was designed. Specifically, all the old 10Mb and 100Mb switches which are still common at the end points need to be upgraded to 1Gb. Alternatively, enterprises may choose to roll out sufficiently redundant 802.11n networks for new generation of end clients like tablets. Nonetheless, the core switches / routers need to be upgraded to handle the additional load and traffic. 

In the data centers themselves, virtualization has led to increasing computing densities and ever more unifying storage / network topologies. This implies even higher switch densities and perhaps a federated storage infrastructure (more of that later). 

From a storage perspective: 

Traditional storage consolidation accomplished by use of SANs is for the most part complete in most enterprises. Some of the challenges include the constraints of performance and mixed workload on the underlying disks. SAN vendors have attempted to accomodate disparate workloads through the use of RAM and now SSD to serve as buffer or Tier 0 storage for frequently accessed data. Given the data explosion though and the drop in cost of SSDs, all Flash / SSD SAN implementations are becoming more common. For heavy IO type of workloads, it is perhaps slightly more expensive to use a SAN built from the ground up for SSD storage. The expense side has to factor in the physical storage in the datacenter, the cooling and overall power consumption for the IOPS that SSD can provide in comparison to traditional disks (aka ‘spinning disk’). Also there are intangibles such as optimization of a multi-tier infrastructure and the fact that in traditional SANs the controllers are bottlenecks in terms of realizing the full potential of SSD. 

Expect to see startups like Violin gain marketshare in enterprise shops, forcing incumbents such as EMC, IBM and Hitachi to start offering similar products (i.e., new product lines, not bolt ons to their current SAN architectures).

In addition, analogous to client server architectures transitioning to web based applications, it would not be surprising that overall storage management decentralizes slightly to a more federated solution architecture. There may be a compelling case for what are to become legacy applications to be based on a single SAN platform due to need for synchronized fail-over, transaction criticality, etc. However, for large volumes of data needing to be processed and synthesized in real-time, use of specific SSD arrays is perfect and in some ways a more explicit recognition of what has been implicitly implied by data warehousing appliances for a while (e.g., Teradata). 

Expect the next wave of infrastructure investment to be heavily influenced by such trends. It will be the first time in the last 5 or so years, that the majority of network investment may be outside of the data center and similarly the use of SSD based SANs will pose a challenge to incumbent vendors and their platforms unlike one they have experienced in the prior decade.

    • #BigData
    • #network fabric
    • #network
    • #storage
    • #Violin
    • #Flash
    • #SSD array
    • #SSD
  • 1 year ago
  • 15
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Impact of Hadoop on Data Movement Space

Recently, there have been several announcement (like 5 new offerings in May 2011) from established vendors such as IBM, EMC, NetApp and others around their commitment and support for Hadoop. Ofcourse apart from the software and hardware bundles, one can go to Cloudera to license a supported version and build their own infrastructure around it.

So what is the impact on the current vendor ecosystem? In some ways it is analogous to when Oracle introduced a supported version of linux (then known as ‘Unbreakable Linux’). In a nutshell, it was to strengthen the relationship with the end cusotmer, while maximizing the land grab in terms of the IT real estate. It was not important whose Linux distribution was being used by a client, as long as it was not Windows. That meant more money for Oracle licenses while mitigating the likelihood of SQL Server, Exchange, .NET development tools gaining traction.

Similarly, this seems to be a play by the DW vendors to gain capabilities around transformation and load dimensions of ETL. The extract portion is not as relevant in the BigData paradigm as most of the data sources (aka generators) use proprietary data stores or simply stream raw data into flat files. 

This has some rather interesting ramifications for ETL vendors. Independent ones such as Informatica, are launching products that address real-time user requirements (e.g., Ultra Messaging) while those part of a larger product suite (e.g., IBM’s InfoSphere Service Director) are attempting to add value by exposing existing enterprise data stores to more event driven data consumers.

Overall the traditional premium for ETL products is fading and the market place is ripe for consolidation. For current enterprises, given Moore’s Law, most data movement that is facilitated via ETL can be handled through more real-time integration suites. For the boundary conditions and data sets in what is termed ‘Big Data’ (e.g., web clickstreams) Hadoop centric tools will be more cost effective. 

As data volumes continue to grow, and decision cycle times shrink, batch approaches such as Hadoop will be replaced by frameworks and architectures that are more ‘real time’. Google already has a ‘real time’ search. How ready is your enterprise for that?

    • #hadoop
    • #etl
    • #datawarehouse
    • #database
    • #Big
    • #bigdata
  • 1 year ago
  • 16
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+
'\x3ciframe width=\x22500\x22 height=\x22304\x22 src=\x22http://www.youtube.com/embed/HwVPxYWDO4w?wmode=transparent\x26autohide=1\x26egm=0\x26hd=1\x26iv_load_policy=3\x26modestbranding=1\x26rel=0\x26showinfo=0\x26showsearch=0\x22 frameborder=\x220\x22 allowfullscreen\x3e\x3c/iframe\x3e'

This is an interesting perspective - though the context is around traditional financial reporting. Much is changing, specifically in the areas of customer intimacy as well as the financial services industry where transaction data is much much greater than say your P&L statement.

Having said that, the other perspectives on evolution of data use and sense making are well represented.

    • #information
    • #reporting
    • #BigData
    • #sensemaking
  • 2 years ago
  • 5
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

What is “Big Data” and Should You Care…

If there is a phrase that is on the hype trajectory these days (a bit like HTML5 is to software developers), it is the notion of Big Data. So what is Big Data, really? It is notion of having to analyze and sift through petabytes of data to gain some competitive edge. 

The reality is that in most enterprises, the core financial systems are still dealing with data volumes in the gigabytes range. So where is all this additional information coming from? How can a company have petabytes of data to sift through? Are we all going the route of Google, which apparently processes 20 petabytes daily?

In short, Big Data is driven by two primary drivers; “computer generated data” and incorporation of “unstructured” data.

Computer (aka machine) generated data has included data such as system logs, measurement devices etc. Though lately, innovative technologies such as RFID, location information from mobile devices have created a preponderance of data that organization wish to keep and discern trends. For instance, it is estimated that ~30 million intelligent meters (part of the another much touted ‘smart grid’ concept) can generate up to 40TB of data in a 90 days time period. As the volume and retention periods extend, it is easy to see the volumes of data rising very rapidly. In fact, some utility companies are leveraging data warehouse technologies to store the raw smart meter data prior to detailed analysis.

The emergence of social web and in general, web 2.0, feeds into the organizational desires to perform sentiment analysis in additional to the now commonplace textual and voice / speech analytics. This in itself is can result in large data volumes (as exemplified by Google) and has lead to emergence of new tools such as Hadoop. The current focus is how best to unify the process of analysis across unstructured and structured data in a manner that is seamless and inconspicuous to the overall discovery.

So to answer the question posed at the onset, ‘Yes, you should care about the data volumes increasing exponentially’. The caveat is that storage of the data itself is least of your worries. Rather more importantly, the overarching approach to analysis and decision making needs to be fundamentally examined in the context of such large data volumes. This is where most organizations will struggle…

    • #analytics
    • #bigdata
    • #petabyte
    • #technology
  • 2 years ago
  • 9
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+
Page 1 of 2
← Newer • Older →

Logo

Musings from the World of Consulting

About

Timely articles on shifting paradigms in the world and their impact on business models, technology and social ramifications...

Pages

  • About the Author

Me, Elsewhere

  • @naumannoor on Twitter
  • ubercynical on Flickr
  • naumannoor on Foursquare
  • Google
  • Linkedin Profile

Twitter

loading tweets…

Following

  • fastcompany
  • redeyechicago
  • is-r
  • mobilesoftware
  • 12for2
  • nicolelapin
  • deliveryking
  • henry74
  • howardtharp
  • cloudfoundryblogblog
  • foursquare

Top

I Dig These Posts

See more →
  • Link via smartercities
    How technology in cities can help deliver a sustainable future

    “People want to live in cities where there’s a high quality of life. These demands are...

    Link via smartercities
  • Photo via fastcompany

    Ingenious Infographic: U.S. Highways, Mapped Like A Subway System

    The graphic language of the London Underground map is so iconic that “[insert...

    Photo via fastcompany
  • Post via nicolelapin
    10 Things (I Think) I've Learned

    image

    1. Check email on the weekends once per day.

    2. Check out all together once per month.

    3. Those that...

    Post via nicolelapin
  • Photo via micasaessucasa

    (via Dubai builds World’s Highest Tennis Court in Burj al-Arab | flylyf)

    Photo via micasaessucasa
  • Photo via shippingandlogistics

    tstitt:

    How Gas/Oil Prices Affect Businesses and Consumers including Transportation

    Photo via shippingandlogistics
  • RSS
  • Random
  • Archive
  • The consultant is in
  • What's on your mind?
  • Mobile
Effector Theme by Pixel Union