Musings from the World of Consulting

  • Random
  • Archive
  • RSS
  • The consultant is in
  • What's on your mind?

Self-Service goes mainstream in Business Intelligence and Analytics

EMC’s recent acquisition of Pivotal Labs coincided with the release of Greenplum Chorus. The former seems to be driven with the need to inject its internal software organization talent and leadership around Agile delivery; large software shops from time to time need a bit of cultural change to enhance productivity and remain nimble. For Pivotal, its distribution and marketing prowess just got a shot in the arm - EMC’s sales and marketing  coupled with its penetration in F1000 companies will help it compete better against the likes of Rally as Agile Delivery is just starting to gain traction. 

The main story though is around the notion of self-service. Just a few years ago, the notion of business users being able to write their own queries would have given their IT counterparts the shivers. IT has long had the centralized, locked down mentality when it came to business intelligence and analytics. They felt only they were intimately familiar with optimization of data access patterns and understood the ramifications of data distribution better than its owners. Though times are changing: most analytical oriented databases can manage adhoc workloads better through smarter query optimization and processing, better suited design (e.g., columnar databases) and make use of new infrastructure capabilities (e.g., SSD, in-memory architecture) In addition, the tools used by business users, such as MicroStrategy or Cognos are getting better at pushing down the various operations to the database, allowing for better use of processing power. 

In the last year or so, the notion of self-service has expanded beyond queries. Chorus is a prime example of this trend. Some of the capabilities are foundational, like federated metadata repository and search. On those dimensions, it is addressing areas where Greenplum was lagging and with this release is closer to par with some of its competitors.

The more notable and leading capabilities are the self-service around provisioning (or ‘spinning out’) of data sets for studies and ability for users to integrate their own data sources via REST or by uploading common file formats. If the underlying data is stored appropriately, it allows data scientists to be self sufficient for most of their daily activities, without relying on IT support. In addition, it accelerates the integration of third party datasources in the investigative phase, enhancing overall organizational learning productivity. 

Finally, few of the capabilities are an implicit acknowledgement that researchers or data analysts are not the most organized bunch - the concept of shared libraries and code seems like a marketing euphemism for code management tools. For most developers these are second nature, though for analytics departments needing to scale as they grow, these become a necessity to preserve the intrinsic knowledge and manage rapid iterations across a large team. 

Like EMC, Microsoft has been focusing on the accessibility and integration of third party data more heavily (versus ‘spinning out’ of datamarts). The first foray was launch of Data Marketplace on Azure. In addition to data sets, there are applications that can be accessed. Supposedly, building on SQL Server 2012, there will be a ‘private’ version of marketplace for use within the organization that would allow users to collaborate on queries, data sets and visualizations. 

It will be interesting to see how the database vendors (e.g., Teradata) react as well as the vertical integrated (e.g., Cognos + DB2 offering from IBM) evolve to address the growing awareness that social collaboration is key to unlocking the information potential of corporate data assets. 

    • #EMC
    • #Greenplum
    • #Chorus
    • #self-service
    • #BI
    • #Analytics
    • #Microsoft
  • 1 year ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+
'\x3ciframe width=\x22500\x22 height=\x22303\x22 src=\x22http://cdn.livestream.com/embed/gigaombigdata?layout=4\x26amp;clip=pla_640d50f2-558c-4cf5-b3e3-41e7c7791990\x26amp;color=0x000000\x26amp;autoPlay=false\x26amp;mute=false\x26amp;iconColorOver=0xe7e7e7\x26amp;iconColor=0xcccccc\x26amp;allowchat=true\x26amp;height=340\x26amp;width=560\x22 style=\x22border:0;outline:0\x22 frameborder=\x220\x22 scrolling=\x22no\x22\x3e\x3c/iframe\x3e'

This was a great discussion on how established leaders in their markets (e.g., Allstate in the Insurance Pricing arena) are using competitive platforms (in this case Kaggle) to bring new algorithmic concepts and analytical approaches to solve classic problems. 

Allstate’s competition was discussed and some interesting perspectives shared.

Eric (from Allstate) mentioned that some of the winning algorithms would have been too complex to explain to Allstate’s customers and thus may not have been suitable for practical implementation. That statement is partially true, as most insurance carriers are using GLM and derivative models whereas the public’s general comprehension ends with simple regression. As such, the causality of higher premiums may be explained in generic terms, the actual contributors are usually deemed trade secrets. The ‘disconnect from the model’ or ‘black box model’ perception is going to increase over time as complex algorithms (sometimes referred to as ‘machine learning’) begins to gain prominence in more consumer facing interactions. 

Another point that was tangentially discussed was the inadvertent over-fitting of the submitted models to the sample data set. This is not surprising given the competitive personalities engaged on Kaggle, as noted Jeremy (Kaggle’s Chief Scientist). The takeaway is that data preparation is paramount and the criteria used to benchmark need further scrutiny. After all, the teams participating are astute and whose deep technical expertise will be focused on winning as per definition. In some cases, the definition may unintentionally deviate from the perceived objectives.

Finally, Jeremy raised an interesting point on how the best models are those from outside the subject domain. Existing benchmarks are set by those familiar with the subject domain and thus well versed in the conventional ways. For a ‘game changing shift’, a radical approach is necessary and generally occurs when applying patterns from other domains/fields.

It is going to be interesting as new algorithms and tools are leveraged in established industries. Though not a big fan of ‘payday loan companies’, the founder of ZestCash was interviewed at the same conference in terms of how his company is using thousands of attributes coalesced into 10 models to better rate and underwrite loans to the underbanked. This is quite revolutionary where the ‘industry benchmark’ is a regression model developed in the 70s using 12-15 criteria  and commonly referred to as FICO.

    • #bigdata
    • #analytics
    • #GLM
    • #insurance
    • #Allstate
    • #machine learning
  • 1 year ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Use of R as an embedded analytics engine

Oracle has made an interesting move in leveraging the open source language R to fill a gap in its portfolio when competing against the likes of IBM (which acquired SPSS) and ‘best of breed’ players like SAS. 

Similar to SAS’s data manipulation extensions for common databases (pushing down the manipulation, filtering etc. to the database) though more like Revolution Analytic’s integration with Netezza (now part of IBM), Oracle has converted base R scripts for execution within the database. Though this is really a first release (e.g., lacking fine grained workload management), it is great move in burnishing its credentials in a space gaining prominence in organizations.

Some thoughts to ponder:

  1. Use of R eases adoption by the upcoming generation of statisticians, actuaries (generally referred to as ‘data scientists’). This is a great advantage in terms of dislodging incumbents such as SAS and SPSS and increasing its portion of software license fees. In this model, greater use of R based analytics in operations results in larger number of processors (technically cores) that are licensed from Oracle. While, it allows organizations to use open source R for desktop based modeling and development.
  2. This addresses the biggest shortcomings of R to date, which has been lack of scalability. Though there are commercial implementations such as those from Revolution Analytics, the ubiquitous nature of Oracle in enterprise computing environments makes it so much easier to scale ideations from the innovation lab into commercial reality.
  3. Though there is some integration with Oracle’s reporting suites, it is a bit of a kludge. Longer term, it will not be surprising to see CRAN suite of visualizations natively implemented in OBIEE and leverage by default when accessing R script output. Tighter integration will aid in visualization development which in today’s world mimics business presentation prep in the early 1990s - anyone remember Harvard Graphics on DOS?
  4. Commercial support for R should aid in its adoption and usage. The latest Redmonk language survey indicates it is the sixth fastest growing language (currently noted as in Tier 2 cluster along with MATLAB and Scala). It will be interesting to see if growth accelerates sufficiently for it to lead the pack when it comes to high level languages for analysis and general computation.

Next on my wish list is an IDE that can aid R scripting in a way that IntelliJ aids Java and Scala (RStudio is just very first gen, IMHO). Thoughts?

    • #R
    • #analytics
    • #oracle
    • #Revolution Analytics
    • #visualization
  • 1 year ago
  • 2
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Is R going to displace Excel as the defacto modeling tool in modern business?

5-6 year seems like an eternity. Though many may not realize it, there have been huge strides made in technological horsepower that is available at our fingertips.

For instance back in 2005, this would have been deemed as a high end workstation: 

Pentium 4 @ 3.2GHz w/2Gb of DDR RAM (and if necessary 10k rpm HDD)

In fact, the predominant usage of such machines would have been in the graphics department, where Adobe’s suite would make use of all the floating point and multiple cores it can get. 

Oddly, enough those, days all analytics, leveraging likes of SAS or SPSS, would be run on UNIX servers or as in the case of large clients, on old iron (aka mainframes). Sure they had PC versions of their flagship programs, though those would be used for algorithm development (akin to expensive IDEs) and serious work or production usage happened on traditional servers. 

So what has changed? 

Couple of things:

  1. Computing power has come a long way. There was a presentation made by EMC at one of its keynote speeches which captured how far things have progressed. Since 2005, there has been 20x (that is, twenty!) fold increase in computing power. Admittedly, one has to be using software optimized to make use of the cores through parallelism and other tricks. 
  2. Availability of open source solutions like R and their use in universities has essentially fueled the usage of advanced statistics and mathematical constructs like never before. In most corporate environments, the analytics tool tends to be Microsoft Excel and the odd plug-in. Now, for experimentation and investigation, all one needs to do is download and use R. Though there are gestapo like IT policies inforce in most corporations, R is fairly self contained and do not require the user to have administration privileges. In certain analytics heavy professions such as actuarial sciences, there are textbooks on usage of R for common problems.
  3. Advent of cheap consumer grade SSD has changed the equation more than people realize. Popularized by Apple through the Airbook, it allows one to have the IO throughput in their laps that most enterprise SANs struggle to achieve (aside from critical Tier 1 application, most NAS shares for office use are slow…). This aids in the integration and loading of disparate data sets, while aiding prototyping by reducing the time for analysis runs. Most analytics related jobs are bound by the IO time taken, rather than CPU time in the initial development phase. 
  4. Through the efforts of Amazon and others, IaaS (Infrastructure as a Service) has matured to a point where users, when the time is right, can tap into HPC power and run applications like R, Datameer on a scale that most corporations would struggle to match using internal resource constraints. And given the sporadic need for such resources, it would be questionable as to the business rationale, especially if one can lease those from a third party like Rackspace or Amazon on an as needed basis. 

Nowadays, most desktops and laptops easily have 4Gb RAM if not 8Gb. Coupled with SSD drives, one can easily build fairly comprehensive prototypes using tools like R. Once ready to deploy, it is only a matter of leveraging Amazon EC2 like infrastructure to gain the scale in terms of computational power.

All this allows one to use their desktop or high end laptop to build and validate complex models using data sets that a few years ago would have needed formal IT support and infrastructure. 

It will not be surprising to see in the next few years increasing sophistication and advancements in how we use data to support business decisions with models aiding as lens to shape our perspectives. This would also imply higher expectations of those providing such organizations advise and expert opinions. Clients will begin to expect analytics infrastructure to support comprehensive data sets that consulting firms can leverage to enrich their internal information to develop insights and models for ongoing inference. Tools like R will become the norm in businesses just as Excel today, is the defacto lingua franca for business models. 

Are you ready?

    • #bigdata
    • #analytics
    • #R
    • #IaaS
    • #Actuary
  • 1 year ago
  • 24
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Advanced Analytics and Transaction Monitoring in Money Laundering Enforcement

I am currently attending the annual ABA / ABA Money Laundering Enforcement Conference in Washington DC.

An interesting theme this year is the validation of the monitoring systems and ‘models’. It is interesting that regulators are building the talent and wherewithal to validate models developed by financial institutions (mostly banks) to detect anomalous behaviour warranting further investigation.

Though most such models are really fancy means of pattern classification and peer benchmarking, it will not be surprising to see some institutions having to address weaknesses. Ironically, it is not going to be some complex Bayesian inference logic, but more mundane (though much more difficult to resolve) issue like data from source systems.

In addition, the regulatory bar is rising, with regulators expecting testing to consider the black box nature of some of the vendor offerings and seeking details on the logic behind the rules (for the ‘white box’ versions).

It will not be surprising to see that the supporting IT departments will be expected to build processes and enhance their methodology to test, validate and enhance models.

In addition, once the foundational issues of data sourcing and quality are surmounted, it would not be surprising to see sophistication of models to increase. The evolution would be similar to that witnessed in the insurance marketplace with pricing and rating models.

Interesting times, indeed.

    • #money laundering
    • #validation
    • #BigData
    • #anal
    • #analytics
  • 1 year ago
  • 5
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

What is “Big Data” and Should You Care…

If there is a phrase that is on the hype trajectory these days (a bit like HTML5 is to software developers), it is the notion of Big Data. So what is Big Data, really? It is notion of having to analyze and sift through petabytes of data to gain some competitive edge. 

The reality is that in most enterprises, the core financial systems are still dealing with data volumes in the gigabytes range. So where is all this additional information coming from? How can a company have petabytes of data to sift through? Are we all going the route of Google, which apparently processes 20 petabytes daily?

In short, Big Data is driven by two primary drivers; “computer generated data” and incorporation of “unstructured” data.

Computer (aka machine) generated data has included data such as system logs, measurement devices etc. Though lately, innovative technologies such as RFID, location information from mobile devices have created a preponderance of data that organization wish to keep and discern trends. For instance, it is estimated that ~30 million intelligent meters (part of the another much touted ‘smart grid’ concept) can generate up to 40TB of data in a 90 days time period. As the volume and retention periods extend, it is easy to see the volumes of data rising very rapidly. In fact, some utility companies are leveraging data warehouse technologies to store the raw smart meter data prior to detailed analysis.

The emergence of social web and in general, web 2.0, feeds into the organizational desires to perform sentiment analysis in additional to the now commonplace textual and voice / speech analytics. This in itself is can result in large data volumes (as exemplified by Google) and has lead to emergence of new tools such as Hadoop. The current focus is how best to unify the process of analysis across unstructured and structured data in a manner that is seamless and inconspicuous to the overall discovery.

So to answer the question posed at the onset, ‘Yes, you should care about the data volumes increasing exponentially’. The caveat is that storage of the data itself is least of your worries. Rather more importantly, the overarching approach to analysis and decision making needs to be fundamentally examined in the context of such large data volumes. This is where most organizations will struggle…

    • #analytics
    • #bigdata
    • #petabyte
    • #technology
  • 2 years ago
  • 9
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Benckmarking Performance of Analytics Environments

Though meant for HPC clusters, it is applicable and very appropos for complex analytics environments that today may be compromised of a multitude of platforms, such as:

1. Data appliance (ala Teradata or IBM’s Netezza) 

2. Compute devices (such as server running R) 

3. Data staging area (e.g., a NAS hosted file share)

Ofcourse, if one were to use Greenplum instead of a data appliance, there is a bit more complexity in the storage subsystem. That would require additional tools to understand whether the design is tuned to the IO profile of the anticipated analytics.

Still a great starting point.

    • #analytics
    • #bigdata
    • #hpc
    • #teradata
    • #netezza
    • #greenplum
  • 2 years ago
  • 7
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Logo

Musings from the World of Consulting

About

Timely articles on shifting paradigms in the world and their impact on business models, technology and social ramifications...

Pages

  • About the Author

Me, Elsewhere

  • @naumannoor on Twitter
  • ubercynical on Flickr
  • naumannoor on Foursquare
  • Google
  • Linkedin Profile

Twitter

loading tweets…

Following

  • fastcompany
  • redeyechicago
  • is-r
  • mobilesoftware
  • 12for2
  • nicolelapin
  • deliveryking
  • henry74
  • howardtharp
  • cloudfoundryblogblog
  • foursquare

Top

I Dig These Posts

See more →
  • Link via smartercities
    How technology in cities can help deliver a sustainable future

    “People want to live in cities where there’s a high quality of life. These demands are...

    Link via smartercities
  • Photo via fastcompany

    Ingenious Infographic: U.S. Highways, Mapped Like A Subway System

    The graphic language of the London Underground map is so iconic that “[insert...

    Photo via fastcompany
  • Post via nicolelapin
    10 Things (I Think) I've Learned

    image

    1. Check email on the weekends once per day.

    2. Check out all together once per month.

    3. Those that...

    Post via nicolelapin
  • Photo via micasaessucasa

    (via Dubai builds World’s Highest Tennis Court in Burj al-Arab | flylyf)

    Photo via micasaessucasa
  • Photo via shippingandlogistics

    tstitt:

    How Gas/Oil Prices Affect Businesses and Consumers including Transportation

    Photo via shippingandlogistics
  • RSS
  • Random
  • Archive
  • The consultant is in
  • What's on your mind?
  • Mobile
Effector Theme by Pixel Union