what we mean by a 360-degree customer view, what it enables, and how we went about doing this at Blueshift. This post is the first in a multi-part series that looks at key innovations in the Blueshift platform.

A 360-degree View of the Customer – Finding Marketing Zen

In this post, we’ll explain what we mean by a 360-degree customer view, what it enables, and how we went about doing this at Blueshift. This post is the first in a multi-part series that looks at key innovations in the Blueshift platform.


The 90/20 Reality of Marketing and the Single Customer View

David Raab of the CDP Institute quotes a recent survey that shows that 90% of marketers think that a unified multi-channel customer view is important, yet only 20% of them have such a view. Other studies, such as one performed by Gartner, find that even FEWER brands (10%) have a 360-Degree customer view.

For those of us on the “technical side” of the food chain (building software for marketers), this is not surprising — and more than likely, much lower than the stats suggest. Creating a Single Customer View is a hard problem to solve — people have been trying for a while, and often promising more than they can deliver. The fundamental goal is to provide customers with a unified and relevant experience across all channels. To achieve this, you need to have a 360-degree view of your customers in real time.

Traditional approaches to solving this problem, such as data warehouses and, later, data lakes, have come up short because they have either not been able to (a) collect the data or (b) organize it effectively in real-time.


What is a 360-degree view of the customer?

The term 360-degree view of the customer is a catchy phrase. And the problem with catchy phrases is they are used as buzzwords, and once that happens, you really have to look carefully under the covers and beyond the hype.

Figure 1: 360-degree view of a customer

Sometimes referred to as a Single Customer View (SCV), a true 360-degree view of the customer is built on having several important types of information about customers/prospects for use in real-time:

Customer Submitted Data (typically captured in a CRM)

  • Customer attributes & demographics such as Name, Gender, Location, Birthday, etc. The data may be submitted by the customer using online forms or collected through other requests for information.
  • Opt-in and other communication choices.
  • Preference Centers built for a user to indicate preferences for brands, colors, categories, genres, and more.

Customer Transactions

  • Transactional data including purchase records, course completions, and lead submissions along with changes in transactions such as cancellations.
  • Subscription data such as enrollment, upgrades, downgrades, and cancellations.
  • Customer service data including trouble tickets submitted, resolved, and still outstanding.

Product Interactions and Behavioral Data (Observed data, gleaned by collecting the customer’s behavior)

  • Web and mobile behavioral data including page views, swipes, clicks, likes, and “add-to-list” actions.
  • Marketing interactions such as opens or clicks of emails or push notifications, and views and responses to ads from multiple channels.

Derived Information (gathered by analyzing the “metadata”/patterns of customer interactions across channels)

  • “Identity” of anonymous visitors to websites or apps inferred using web cookies or device IDs, combined with login or opt-in.
  • Location information inferred by mapping IP address or latitude/longitude data.
    User affinity towards a category or brand that is inferred through browsing and buying behaviors (beyond stated preferences).
  • Stage in customer journey derived from customer activity.
  • Lifetime attributes such as orders, visits, sessions etc.
  • The propensity to convert based on recent and lifetime activity.

It’s important to note that in today’s online world, the real value of this 360-degree view can only be realized if all these data types are indexed and query-able for use in real-time. The data must be usable.


Why does this matter?

Paraphrasing another quote from David Raab, quality data, and more specifically an accurate 360-degree view of the customer, is the fuel that drives effective marketing and provides customers with the best experiences. And for organizations today, it provides the foundation for all customer-facing activities.

Figure 2: Foundational benefits of a 360-degree view of a customer

There are six essential benefits of having an accurate 360-degree view of the customer:

  1. Single Source of Truth
    Providing data access and integrity is fundamental to any organization’s success because it gives a single source of truth about your customers.
  2. Personalization and Segmentation
    Enabling dynamic personalization and segmentation of campaigns using multiple behavioral attributes collected in real-time makes campaigns more effective and relevant.
  3. Data-Driven Triggers
    With data-driven triggered events, companies automatically interact with customers in real-time to influence their decisions.
  4. Cross-Channel Engagement
    Simplifying the orchestration of cross-channel campaigns across multiple systems yields consistent and relevant engagement across all marketing channels.
  5. Compliance and Security
    By having a single source of truth, supporting compliance with rapidly changing regulations and practices around personally identifiable information and the protection of this information through directives like GDPR becomes much easier.
  6. Accurate Reporting
    Facilitating a consistent and accurate reports of activities and results.


Imperatives to building our 360-degree customer view

Even before Blueshift started building our 360-degree view, we stipulated the following key principles that were necessary for our view of the data to solve problems for the marketer:

Our customer view has to be updated almost instantaneously after any new interaction. We stipulated that this had to happen in near real-time because many marketing activities, such as campaign journeys are triggered based on customer activities, and personalization is far more effective in the context of recent activity

Unified cross-channel identity
The data has to be query-able with various forms of identity ranging from customer ids, email addresses and Facebook IDs to mobile device tokens and cookies.

Open data schema
We recognized that every business has a different way of looking at data, and we needed an open schema to more easily ingest and work with multiple forms of data coming from multiple sources.

Flexibility in modeling the data
Each piece of data may have something to tell us about how the customer interacted with the brand, and our system needed to model this data into the 360-degree view. For instance, for a client in the hospitality industry, a customer might have multiple “events” corresponding to the same booking ( book, check-in, check-out, complete a survey), and additional events relating to other bookings. Our 360-degree view had to capture and store all of these events in the same context. Similarly, in a Media business, customers might interact with content in different categories or from different authors. Here we had to model all of these interactions relative to the “catalog” of content or products for the media business.

Our technical challenges in building Blueshift’s 360-degree view while adhering to these core principles were in these four important areas:

  • Gathering all the different pieces of disparate data about an individual from dozens of input sources and hundreds of events in each session.
  • Resolving Identity and stitching together all this loosely structured data in order to get an accurate view of the behavior of each individual
  • Building a single customer view from 1 & 2 in “real-time” so that customer behavior can drive personalization and interactions across multiple channels
  • Maintaining data integrity and consistency across different systems(search, user store, data warehouse, data science, analytics)



The unified 360-degree view of a customer is a key foundational element needed to more effectively market to and interact with customers using artificial intelligence (AI) techniques. In our next post in this series, we will discuss how we went about building this single customer view in the Blueshift platform and the challenges we encountered.

For More Information
Read more about AI-powered marketing in our resources section.

This post was made possible through joint collaboration with Atri Chatterjee, Anuraj Pandey, and Cibin George.

Watch this rare webinar with Analysts from Forrester Research and VentureBeat hosted by Blueshift about getting the most out of your customer data with AI

Real-Time Segment

Behind-the-Scenes: Real-time segments with Blueshift

(Here is a behind the scenes look at the segmentation engine that powers Programmatic CRM.)

Real-time segmentation matters: Customers expect messages based on their most recent activity. Customers do not want reminders for products they may have already purchased or messages based on transient past behaviors that are no longer relevant.

However, real-time segmentation is hard: it requires processing large amounts of behavioral data quickly. This requires a technology stack that can:

  • Process event & user attributes immediately, as they occur on your website or mobile apps
  • Track 360-degree customer profiles and deal with data fragmentation challenges
  • Scale underlying data stores to process billions of customer actions and support high write and read throughput.
  • Avoid time consuming steps of data modeling that require human curation and slows down on-boarding

Marketers use Blueshift to reach each customer as a segment-of-one, and deliver highly personalized messages across every marketing channel using Blueshift’s Programmatic CRM capabilities. Unlike previous generation CRM platforms, Segments in Blueshift are always fresh and updated in real-time, enabling marketers to respond to the perpetually connected customer in a timely manner. Marketers use the intuitive and easy to use segmentation builder to define their own custom segments by mixing and matching filters across numerous dimensions including: event behavioral data, demographic attributes, predictive scores, lifetime aggregates, catalog interactions, CRM attributes, channel engagement metrics among others.


Segments support complex filter conditions across numerous dimensions

Segments support complex filter conditions across numerous dimensions

Behind the scenes, Blueshift builds a continually changing graph of users and items in the catalog. The edges in the graph come from user’s behavior (or implied behavior), we call this the “Interaction graph”. The “interaction graph” is further enriched by machine-learning models that add predicted edges and scores to the graph (if you liked item X, you may also like item Y) and also expand user attributes through 3rd party data sources (example: given the firstname “John”, with reasonable confidence we can infer gender is male).

Blueshift interaction graph

Blueshift interaction graph

The segment service can run complex queries against the “interaction graph” like: “Female users that viewed ‘Handbags’ over $500 in last 90 days, with lifetime purchases over $1,000 and not using mobile apps recently and having a high churn probability” and return those users within a few seconds to a couple of minutes.

360-degree user profiles

For every user on your site/mobile app, Blueshift creates a user profile that tracks anonymous user behavior and merges it with their logged-in activities across devices. These rich user profiles combine CRM data, aggregate lifetime statistics, catalog-related activity, predictive attributes, campaign & channel activity and website / mobile app activity. The unified user profiles form the basis for segmentation. A segment query matches these 360 degree user profiles against the segment definition to identify the target set of users.

360-degree user profiles

360-degree user profiles in Blueshift

Multiple data stores (no one store to rule them all)
The segmentation engine is powered by several different data stores. A given user action or attribute that hits the event API is replicated across these data stores including: timeseries stores for events, relational database for metadata, in-memory stores for aggregated data & counters, key-value stores for user lookups, as well as a reverse index to search across any event or user attributes quickly. The segmentation engine is tuned for fast retrieval of complex segment definitions compared to a general purpose SQL-style database where joins across tables could take hours to return results. The segmentation engine leverages data across all these data stores to pull the right set of target users that match the segment definition.

Real-time event processing

Website & mobile apps send data to Blueshift’s event APIs via SDKs and tag managers. The events are received by API end-points and written to in-memory queues. The event queues are processed continuously in-order, and updates are made across multiple data stores (as described above). The user profiles and event attributes are updated continuously with respect to the incoming event stream. Campaigns pull the audience data just-in-time for messaging, which result in segments that are continuously updated and always fresh. Marketers do not have to worry about out of date segment definitions and avoid the “list pull hell” with data-warehouse style segmentation.

Dynamic attribute binding

The segmentation engine further simplifies onboarding user or event attributes by removing the need to model (or declare) attribute types ahead of time. The segmentation engine dynamically assesses the type of each new attribute based on sample usage in real-time. For instance, an attribute called “loyalty_points” with a value of “450”, would be interpreted as a number (and show related numeric operators for segmentation), while an attribute like “membership_level” with a value of “gold” would be dynamically interpreted as a string (and show related string comparison operators for segmentation), or an attribute like “redemption_at” with a value like “2016-09-23” will be interpreted as a timestamp (and show relative time operators).

Several Blueshift customers have thousands of CRM & event attributes, and are able to use these attributes without any data modeling or declaring their data upfront, saving them numerous days of implementing data schemas in SQL-based implementations.

The combination of 360-degree user profiles, real-time event processing, multiple specialized data stores and dynamic attribute binding, empowers marketers to create always fresh and continuously updated segments.


Passing named arguments to Ruby Rake tasks using docopt for data science pipelines


Ever considered using rake for running tasks, but got stuck with the unnatural way that rake tasks pass in the arguments? Or have you seen the fancy argument parsing docopt and alikes can do for you? This article describes how we integrated docopt with the rake command so we can launch our data science pipelines using commands like this:

$ bundle exec rake stats:multi -- --sites=foo.com,bar.com \
--aggregates=pageloads,clicks \
--days-ago=7 --days-ago-end=0

This command would for instance launch daily aggregate computations for clicks and pageloads, for each of the sites foo.com and bar.com, and this for each individual day in the last 7 days.

Not only can you launch your tasks using descriptive arguments, you get automated argument validation on top of it. Suppose we launch the task using the following command:

$ bundle exec rake stats:multi

Then the task would fail with the following help message

Usage: stats:multi -- --sites= \
[--aggregates=] \
[ (--start-date= --end-date=) | \
(--days-ago= [--days-ago-end=]) ]
stats:multi -- --help

It will display the mandatory and/or optional arguments and the possible combinations (e.g. mutually exclusive arguments). And the best thing of all is that all you have to do to obtain this advanced validation, is merely specifying the string just like the one you are seeing here: indeed, docopt uses your specification of the help message to process all you wish for your arguments!

The remainder of this post will explain how to set this up yourself and how to use it. This guide assumes you have successfully configured your system for using ruby and rails and the bundle command. Here are guides on how to set up RVM and getting started with Rails.

Configuring your Rails project to use docopt

docopt is an argument parsing library available for many different languages. For more details, on what it does, have a look at the documentation here. We use it as a Ruby gem. You can simply add it to your project by editing your Gemfile in your project root by adding:

gem 'docopt', '0.5.0'

Then run

$ bundle install

in your project directory. This should be sufficient to make your project capable of using the docopt features.

Anatomy of the argument specification string

First, we should elaborate a bit how docopt knows what to expect and how to parse/validate your input. To make this work, you are expected to present docopt with a string that follows certain rules. As mentioned above, this is also the string that is being show as the help text. More specific, what it expects is a string that follows the following schema:

Usage: #{program_name} -- #{argument_spec}
#{program_name} -- --help

where program_name equals to the name of the command that is being run, -- (double dash) – this is not due to doctopt but due to rake (more on that in a moment), and argument_spec which can be anything you want to put there.

Let’s look at the aforementioned example:

Usage: stats:multi -- --sites= \
[--aggregates=] \
[ (--start-date= --end-date=) | \
(--days-ago= [--days-ago-end=]) ]

Here, the program_name is stats:multi, which is the actual namespace and task name for our rake task, ‘–‘ and the argument_spec is "--sites= [--aggregates=] [ (--start-date= --end-date=) | (--days-ago= [--days-ago-end=]) ]"

Now, let’s go into details of the argument_spec (split over multiple lines for readability):

--sites= \
[--aggregates=] \
[ (--start-date= --end-date=) | \
(--days-ago= [--days-ago-end=]) ]

Basic rules

docopt considers arguments mandatory, unless they are enclosed in brackets [] – then they are optional. So in this example, our only --sites is required. It also requires a value, given that it is being followed by =. However, here could be anything, and is used to give the user an idea of what is expected as the type of argument. If you would enter --sites on the input without specifying a value, docopt will return an error that the value is missing. No effort needed on your end!

Optional arguments

The next argument [--aggregates=] follows the same pattern, except that this one is fully optional. We will in our code handly the case where this is not specified and come up with some default values.

Grouping and mutual exclusion

The last – optional – argument is used to specify the dates we want to run our computation for, and we want to have three ways of doing so this:

  • EITHER by specifically telling the start-date AND end-date
  • OR by specifying the number of days-ago before the time of the command, taking as an end date the date of the command being run (e.g. 7 days ago until now)
  • OR by specifying the number of days-ago until days-ago-end (e.g. to backfill something between 14 days ago and 7 days ago).

Here is where complicated things can be achieved in a simple manner. The formatting we used for this is in fact:

[ ( a AND b ) | ( c [ d ] ) ]

docopt requires all arguments in a group (...) to be presented on the input. If only a or b are given, it will return and inform us about the missing argument.

Similarly, a logical OR can be added via | (pipe). This will make either of the options a valid input.

Furthermore, you can combine optional arguments within a group, like we did with ( c [ d ] ). This will make the parser aware of the fact that d (in the real example above [--days-ago-end=]) is only valid when c (--days-ago= in the example) has been presented. Trying to use this parameter with --start-days will result in an error.

Note that this whole complex group is optional and we again will come up with some defaults in our code that handles the parsed arguments.


Lastly, it’s noteworthy that flags (i.e. arguments that don’t take a value), such as --force, will result in a true/false value after parsing.

For more information and examples, consult the docopt documentation here. However, the explanation above should get you already a long way.


Now that you have understanding of how the argument string defines how your input will be parsed and/or be errored out, we can write a class that wraps all of this functionality together, and exposes us only to specifying this string and getting a map with the parsed values in return.

To this end, we wrote the RakeTaskArguments.parse_arguments method:

class RakeTaskArguments
def self.parse_arguments(task_name, argument_spec, args)
# Set up the docopt string that will be used to pass the
# input along
doc = <<DOCOPT
Usage: #{task_name} -- #{argument_spec}
#{task_name} -- --help
# Prepare the return value
arguments = {}
# Because the new version of rake passes the -- along
# in the args variable,
# we need to filter it out if it's present
args.delete_at(1) if args.length >= 2 and args.second == '--'
# Have docopt parse the provided args (via :argv) against
# the doc spec
Docopt::docopt(doc, {:argv => args}).each do |key, value|
# Store key/value, converting '--key' into 'key'
# for accessibility
# Per docopt pec, the key '--' contains the actual
# task name as a value
# so we label it accordingly
arguments[key == "--" ? 'task_name' :
key.gsub('--', '')] = value
rescue Docopt::Exit => e
return arguments

The method takes 3 arguments:

  • the rake task_name that we want to execute
  • the argument_spec we discussed before
  • the args actual input that was provided when launching the task and that should be validated.

Parsing and validation happens magically by

Docopt::docopt(doc, {:argv => args})

while it returns a map with the keys and values for our input arguments. We iterate over the key-value pairs and strip the leading -- (double dash – e.g. --sites) from the keys so we can access them in the resulting map later on via their name (e.g. ...['sites'] instead of ...['--sites']), which is just more practical to deal with.

the solo -- (double dash) that keeps coming back

We keep seeing this solo -- floating around in the strings, like stats:multi -- --sites=. As was pointed out here on StackOverflow, this is needed to make the actual rake command stop parse arguments. Indeed, without adding this -- immediately after the rake task you want to execute, rake would consider the subsequent arguments to be related to rake itself. Therefore, we also have it in our docopt spec

#{task_name} -- #{argument_spec}

to make sure the library does not parse it out. It is inconvenient, but hacking this up this way has way more benefits if you get used to it.

WARNING: It seems that in rake version 10.3.x, this -- was not passed along in the ARGV list, but the newer version of rake, 10.4.x DOES pass it along. Therefore we added the following code:

args.delete_at(1) if args.length >= 2 and args.second == '--'

which removes this item from the list before we pass it to docopt. Also note that this line of code removes the second element from the list, as the first element is always the program name.

Rake task with named arguments

Once you have the docopt gem installed and the RakeTaskArguments.rb class available in your project, we can specify the following demo rake task:

namespace :stats do
desc "Compute pageload statistics for a list of sites and a "\
"given window of time"
task :pageloads, [:params] => :environment do |t, args|
# Parse the arguments, either from ARGV in case of direct
# invocation or from args[:params] in the case it was
# called from other rake_tasks
parameters = RakeTaskArguments.parse_arguments(t.name,
"--sites= [ (--start-date= --end-date=) | "\
(--days-ago= [--days-ago-end=]) ]",
args[:params].nil? ? ARGV : args[:params])
# Get the list of sites
sites = parameters["sites"]
# Validate and process the start and end date input
start_date, end_date = RakeTaskArguments.get_dates_start_end(
parameters["start-date"], parameters["end-date"],
parameters["days-ago"], 0, parameters["days-ago-end"])
# For each of the sites
sites.split(',').each do |site|
# Pretend to do something meaningful
puts "Computing pageload stats for site='#{site}' "\
"for dates between #{start_date} and #{end_date}"
end # End site loop

This basic rake task follows a really simple and straightforward template. However, first we need to understand how this task get’s it’s input arguments. As briefly mentioned before, this task will receive the input in the ARGV variable in Ruby. However, when a rake task calls another rake task, this variable might not contain the correct information. Therefore we enable parameter passing into the task by defining the following header:

task :pageloads, [:params] => :environment do |t, args|

This way, IF this task was called from another task, args will contain a field called :params that contains the arguments that the parent task passed alogn to this task. A detailed example of that follows later on. This matters because we decide at runtime what input to pass to the argument validation. So, to pass the input for validation, we just call

parameters = RakeTaskArguments.parse_arguments(t.name, "--sites= "\
"[(--start-date= --end-date=) | (--days-ago= [--days-ago-end=])]",
args[:params].nil? ? ARGV : args[:params])

This command passes the task_name (via t.name), the argument specification and the input (either via ARGV or args[:params]) for validation to docopt. At this point, you are guaranteed that the parameters return value contains everything according to the schema you specified, or your code has already errored out at this point.

If you then want to access some of the variables, you can simply use

sites = parameters["sites"]
start_date, end_date = RakeTaskArguments.get_dates_start_end(
parameters["start-date"], parameters["end-date"],
parameters["days-ago"], 0, parameters["days-ago-end"])

This last line sets up a start-date and end-date based on some validation and/or defaults we specified in a method that is not covered in this article. The code is available on github though.

Rake task calling other rake tasks

Finally, we cover the case where a meta-task is actually invoking other tasks (in case you want to group certain computations). As mentioned above, this has an impact on how the arguments get passed into the task. Let’s consider the following meta-task:

desc "Run multiple aggregate computations for a given "\
list of sites and a given window of time"
task :multi, [:params] => :environment do |t, args|
# Parse the arguments, either from ARGV in case of
# direct invocation or from args[:params] in the case
# it was called from other rake_tasks
parameters = RakeTaskArguments.parse_arguments(t.name,
"--sites= [--aggregates=] [ (--start-date= --end-date=) "\
"| (--days-ago= [--days-ago-end=]) ]",
args[:params].nil? ? ARGV : args[:params])
# Get the list of sites
sites = parameters["sites"]
# Just for demo purposes, you would normally fetch
# this elsewhere
available_aggregates = ["pageloads", "clicks"]
# Fetch the list of
aggregates = parameters["aggregates"].nil? ?
available_aggregates.join(",") :
parameters["aggregates"], available_aggregates)
# Validate and process the start and end date input
start_date, end_date = RakeTaskArguments.get_dates_start_end(
parameters["start-date"], parameters["end-date"],
parameters["days-ago"], 0, parameters["days-ago-end"])
# For each of the sites
sites.split(',').each do |site|
# For each of the tables
aggregates.split(',').each do |aggregate|
# Prepare an array with values to pass to the sub rake-tasks
parameters = []
parameters.push("--sites=#{site}") # just one single site
self.execute_rake("stats", aggregate, parameters)
end # End site loop

The template used in this task is very similar to a simple rake task. The main difference is that we added a list of aggregates you can specify on the input, which is validated against certain allowed values (again, out of the scope of this article). The :multi task then calls the appropriate tasks with the given parameters.

What’s new here is the way the meta-task calls the other tasks:

# Prepare an array with values to pass to the sub rake-tasks
parameters = []
parameters.push("--sites=#{site}") # just one single site
self.execute_rake("stats", aggregate, parameters)

Basically, we construct a list of arguments that emulates as if the input to the task was provided on the command line. We then call the other rake task using the following helper function:

# Helper method for invoking and re-enabling rake tasks
def self.execute_rake(namespace, task_name, parameters)
# Invoke the actual rake task with the given arguments
# Re-enable the rake task in case it is being for
# different parameters (e.g. in a loop)

As a rake task is generally intended to be run only once, invoking it again would have no effect. But as we launch the same tasks with different parameters, we can reenable the tasks for execution. This helper function shields us from these technicalities and we can just call the execute function with the namespace, the task name and the parameters. When Ruby calls .invoke(parameters) on a rake task, these parameters will end up in the args[:params] we discussed before.


So, that concludes or extensive article on how to add a lot of flexibility to arguments you provide to rake. In the end, we covered

  • How you can easily add docopt to a Rails project
  • How docopt argument specification strings look like and how they work
  • How you could write a wrapper class that encapsulates all that functionality
  • How you can plug this into simple rake tasks
  • How you can run meta rake tasks that call other tasks while keeping the flexbility for your input arguments

The full code and working examples of this article are available here on GitHub.

We hope this article helps you to get something like this set up for your own stacks as well, and that it increases your productivity. If you have any comments, questions or suggestions, feel free to let us know!


* The code for this article can be found on GitHub
* docopt.rb documentation
* rake documentation
* Brief mention of the double dash issue with Rake

Data code

How do you know if your model is going to work? Part 4: Cross-validation techniques

Concluding our guest post series!

Authors: John Mount (more articles) and Nina Zumel (more articles).

In this article we conclude our four part series on basic model testing.

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this concluding Part 4 of our four part mini-series “How do you know if your model is going to work?” we demonstrate cross-validation techniques.

Previously we worked on:

Cross-validation techniques

Cross validation techniques attempt to improve statistical efficiency by repeatedly splitting data into train and test and re-performing model fit and model evaluation.

For example: the variation called k-fold cross-validation splits the original data into k roughly equal sized sets. To score each set we build a model on all data not in the set and then apply the model to our set. This means we build k different models (none which is our final model, which is traditionally trained on all of the data).

Notional 3-fold cross validation (solid arrows are model construction/training, dashed arrows are model evaluation).

This is statistically efficient as each model is trained on a 1-1/k fraction of the data, so for k=20 we are using 95% of the data for training.

Another variation called “leave one out” (which is essentially Jackknife resampling) is very statistically efficient as each datum is scored on a unique model built using all other data. Though this is very computationally inefficient as you construct a very large number of models (except in special cases such as the PRESS statistic for linear regression).

Statisticians tend to prefer cross-validation techniques to test/train split as cross-validation techniques are more statistically efficient and can give sampling distribution style distributional estimates (instead of mere point estimates). However, remember cross validation techniques are measuring facts about the fitting procedure and not about the actual model in hand (so they are answering a different question than test/train split).

There is some attraction to actually scoring the model you are going to turn in (as is done with in-sample methods, and test/train split, but not with cross-validation). The way to remember this is: bosses are essentially frequentist (they want to know their team and procedure tends to produce good models) and employees are essentially Bayesian (they want to know the actual model they are turning in is likely good; see here for how it the nature of the question you are trying to answer controls if you are in a Bayesian or Frequentist situation).

Read more.


data code

How do you know if your model is going to work? Part 3: Out of sample procedures

Continuing our guest post series!

Authors: John Mount (more articles) and Nina Zumel (more articles).

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 3 of our four part mini-series “How do you know if your model is going to work?” we develop out of sample procedures.

Previously we worked on:

Out of sample procedures

Let’s try working “out of sample” or with data not seen during training or construction of our model. The attraction of these procedures is they represent a principled attempt at simulating the arrival of new data in the future.

Hold-out tests

Hold out tests are a staple for data scientists. You reserve a fraction of your data (say 10%) for evaluation and don’t use that data in any way during model construction and calibration. There is the issue that the test data is often used to choose between models, but that should not cause a problem of too much data leakage in practice. However, there are procedures to systematically abuse easy access to test performance in contests such as Kaggle (see Blum, Hardt, “The Ladder: A Reliable Leaderboard for Machine Learning Competitions”).

Notional train/test split (first 4 rows are training set, last 2 rows are the test set).

The results of a test/train split produce graphs like the following:



The training panels are the same as we have seen before. We have now added the upper test panels. These are where the models are evaluated on data not used during construction.

Notice on the test graphs random forest is the worst (for this data set, with this set of columns, and this set of random forest parameters) of the non-trivial machine learning algorithms on the test data. Since the test data is the best simulation of future data we have seen so far, we should not select random forest as our one true model in this case- but instead consider GAM logistic regression.

We have definitely learned something about how these models will perform on future data, but why should we settle for a mere point estimate. Let’s get some estimates of the likely distribution of future model behavior.

Read more.



How do you know if your model is going to work? Part 2: In-training set measures

Continuing our guest post series!

Authors: John Mount (more articles) and Nina Zumel (more articles).

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 2 of our four part mini-series “How do you know if your model is going to work?” we develop in-training set measures.

Previously we worked on:

  • Part 1: Defining the scoring problem

In-training set measures

The most tempting procedure is to score your model on the data used to train it. The attraction is this avoids the statistical inefficiency of denying some of your data to the training procedure.

Run it once procedure

A common way to asses score quality is to run your scoring function on the data used to build your model. We might try comparing several models scored by AUC or deviance (normalized to factor out sample size) on their own training data as shown below.



What we have done is take five popular machine learning techniques (random forest, logistic regression, gbm, GAM logistic regression, and elastic net logistic regression) and plotted their performance in terms of AUC and normalized deviance on their own training data. For AUC larger numbers are better, and for deviance smaller numbers are better. Because we have evaluated multiple models we are starting to get a sense of scale. We should suspect an AUC of 0.7 on training data is good (though random forest achieved an AUC on training of almost 1.0), and we should be acutely aware that evaluating models on their own training data has an upward bias (the model has seen the training data, so it has a good chance of doing well on it; or training data is not exchangeable with future data for the purpose of estimating model performance).

There are two more Gedankenexperiment models that any machine data scientist should always have in mind:

  1. The null model (on the graph as “null model”). This is the performance of the best constant model (model that returns the same answer for all datums). In this case it is a model scores each and every row as having an identical 7% chance of churning. This is an important model that you want to better than. It is also a model you are often competing against as a data science as it is the “what if we treat everything in this group the same” option (often the business process you are trying to replace).The data scientist should always compare their work to the null model on deviance (null model AUC is trivially 0.5) and packages like logistic regression routinely report this statistic.
  2. The best single variable model (on the graph as “best single variable model”). This is the best model built using only one variable or column (in this case using a GAM logistic regression as the modeling method). This is another model the data scientist wants to out perform as it represents the “maybe one of the columns is already the answer case” (if so that would be very good for the business as they could get good predictions without modeling infrastructure).The data scientist should definitely compare their model to the best single variable model. Until you significantly outperform the best single variable model you have not outperformed what an analyst can find with a single pivot table.

At this point it would be tempting to pick the random forest model as the winner as it performed best on the training data. There are at least two things wrong with this idea:

Read more.


Model Data

How do you know if your model is going to work? Part 1: The problem

This month we have a guest post series from our dear friend and advisor, John Mount, on building reliable predictive models. We are honored to share his hard won learnings with the world.

Authors: John Mount (more articles) and Nina Zumel (more articles) of Win-Vector LLC.

“Essentially, all models are wrong, but some are useful.”
George Box

Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.

We’ve discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it’s better than the models that you rejected?

Bartolomeu Velho 1568
Geocentric illustration Bartolomeu Velho, 1568 (Bibliothèque Nationale, Paris)


Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest “Statistics as it should be” series, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a negotiable business goal (one of the many ways data science differs from pure machine learning).

To organize the ideas into digestible chunks, we are presenting this article as a four part series (to finished in the next 3 Tuesdays). This part (part 1) sets up the specific problem.

Read more.