Fun with Decision Trees using R and SQL Server

As those who have been reading this blog know, I’ve recently been spending a lot of time doing statistical analysis with R and SQL Server and have really been enjoying digging in to the various bits of functionality that is offered.

One thing I’ve learned over the years is that I am a very “classification-oriented” individual when it comes to working with data. I work best with data sets that I understand and that I’ve been able to sort into the various buckets that make sense to me based on what I’m trying to accomplish with the data.

The problem with this need for nice and tidy data classification is that it doesn’t work well when you don’t really have a complete understanding of the data set you’re working with. This especially becomes an issue if you’re trying to mine the data you have in order to predict future outcomes based on past performance (a requirement that is becoming more and more importing to “Data Scientists” as more and more organizations make that shift to true data-driven processes.

Understanding Data

If you read my previous post on creating a large SQL Server database, you’ve seen some of the data that I am playing around with. Obviously there is a lot of interesting information locked within the mortgage loan application data stored in the HMDA database. One specific use-case with this data might be to look at some of the deciding factors related to home loans being granted and create some predictive models based on the data. For example, I might want to predict the outcome of a home loan application process based on factors like the purpose of the loan, the income of the applicant, and some other factors such as race. This could be very useful for a mortgage company to look at and see who to target for an ad campaign, or maybe to research things such as do race or sex have any correlation to the amount of a home loan given. In order to answer this use case, we need to have a good understanding of the data we’re working with and see what it looks like when shaped into a “Decision Tree” (for more information on what exactly a decision tree is, take a look at: http://en.wikipedia.org/wiki/Decision_tree )

Creating a Data Set to Work With

If we wanted to use the home mortgage disclosure data to look at past performance of home loans, the first thing we need to do is create a manageable data set to use to create a model. Since I live in El Paso County, Colorado, I’ll create a table of data that just details information on Home Loans in this area. Given the database that was created earlier (see previous post) We can create a subset of the data with the following query:

SELECT 
loan_purpose AS [purpose]
,loan_amount AS [amount]
,CASE WHEN applicant_race = 'White' THEN applicant_race ELSE 'Non-White' END AS [race]
,applicant_sex AS [sex]
,applicant_income AS [income]
,CASE WHEN denial_reason = 'Approved' THEN denial_reason ELSE 'Denied' END AS [status]
INTO
tblElPasoCountyLoanStatus
FROM
vColoradoLoans
WHERE county='El Paso County'


 



What this will do is create a simple table that has 6 columns. I’ve simplified the data slightly so that some of the factors (race, denial reason) are binary values as opposed to continuous. While this is not a necessary transformation for the most part, it helps simplify the output for the purposes of this discussion.



Loading the Appropriate Packages and Libraries in R



Once I’ve created the table above in SQL Server, I can load the data into R and begin analysis. Before we get too deep in the creation of the decision tree, I should mention that I am going to use a package called “rpart” as well as a package called “rpart.plot”. If you are really curious and would like to know the science behind the rpart package, there is a very dry document here that explains the algorithm in detail: http://www.mayo.edu/hsr/techrpt/61.pdf 



The rpart package can be found here: http://cran.stat.ucla.edu/web/packages/rpart/index.html



and rpart.plot can be found here: http://cran.stat.ucla.edu/web/packages/rpart.plot/index.html 



Of course you really don’t need to know where exactly the packages are, you can install them with the R command as follows:



image



When you execute the above command in the R console, you will be prompted for the mirror site you wish to use, and the package will be downloaded, unpacked and installed. Once the packages are installed, you can load them into your R environment as follows:



image





Connecting to SQL Server Data





Once the libraries are loaded, we need to obtain the data we’re going to work with. If you haven’t worked with SQL Server data in R before, you might want to read my previous post on connecting R to SQL Server via the ODBC library. First we need to setup an ODBC channel to connect to the table we created above. Is is done via the following command:



image



(Remember that “HMDAData” is the name of the ODBC DSN I created to connect to my database)



Once the channel is created, I can load the data from the table via the sqlFetch command in R as follows:



image



Examining the Data



This loads the data from the table into the R variable “loanstatus”. You can view a summary of the loanstatus as follows:



image







Already you can see that we’ve extracted some good information out of the database. (as a side note here, I think this is VERY COOL! Think of all the SQL Queries I’d have to run to get this same information from the table I created earlier)



Now that we have the data in an R variable, we can start to create a decision tree. Since we ultimately want to use this data to predict the amount of a loan based on certain factors such as race, sex and income, we’ll create a regression tree shaped to those parameters. We will use the rpart function in order to create the tree. rpart is a very simplistic function that accepts the following parameters:




  • formula – The formula is specified in the following format: outcome ~ predictor1 + predictor2 + predictor3 etc.


  • data – The specific data frame to use


  • method – “class” for a classification tree or “anova” for a regression tree



Remember that within R, you can always type ? <function> to get a full description of a particular function.



In our case, the rpart command would look like this:



image



Once we’ve created the tree variable (this can be named anything, I just kept it simple and named it “tree” here) we can look at a summary and determine what it looks like:



image



(there are a total of 13 nodes in my specific example, and I can’t paste the entire tree here in text form)



One big factor in determining how useful a particular decision tree going to be is to examine the “complexity parameter” (“cp” for short) for each of the nodes. The CP is used to cross-validate the data and make sure that the tree is pruned to remove the portions of the tree that are “over fitted” (You can read about the terminology here: http://www.statmethods.net/advstats/cart.html ) . Since I am a very visual person, I like to see this cross-validation data in chart form as follows:



image



Which generates a nice graph showing me the relative size of the tree, the cross-validated error that is generated, and the resulting cp:



image



Since the tree that I am working with is relatively small, I am not going to worry about pruning it here and removing the “bad” cp values. To generate the decision tree, use the following command:



image



(Remember that you can use the ? command to get a full listing of all options for a given function. In this case, by using type=4 I am instructing R to generate a plot containing all nodes and all information, and by using extra=1 I am instructing R to include the number of nodes in each branch)



image



With this visual of the tree, I can see that income seems to be a deciding factor, and it splits at approximately 90,000. Following the tree to the left, for those with less than 90K income, we see a split for Home Improvement loans versus Home Purchase and refinance. For the purchase and refinance, we see another split at approximately 52K income. Back on the right side of the tree we see a split at approximately 208K income, with the same split for home improvement loans versus purchase and refinance.















Conclusion



Being the data geek that I am, I could continually refine this model and graph it to start finding the patterns and determining just exactly how the data is shaped. At that point I could feed additional data into the model and use it to predict outcomes based on past performance. There are many things that can be done with Decision Trees and I’ve only scratched the surface here with this post. I’ll be writing many more of these posts in the future as I continue to explore the magic of data mining with R.

2011 eDiscovery Year End Wrap-up

It has certainly been a banner year in eDiscovery.  Judge Scheindlin kicked things off with a bang with her decision in National Day Laborer Organizing Network v. U.S. Immigration and Customs Enf. Agency[1], that the federal government must include metadata in Freedom of Information Act (FOIA) products because certain key metadata fields are an integral part of public records.  This ruling struck fear into every government agency and would have created the need for massive changes to the way they kept and produced records.  However, Judge Scheindlin withdrew the opinion in June explaining that, “as subsequent submissions have shown, that decision was not based on a full and developed record.”  She further stated that “[b]y withdrawing the [previous] decision, it is the intent of this Court that the decision shall have no precedential value in this lawsuit or any other lawsuit.”  I guess we are left to draw our own conclusions from that statement.

2011 also saw the rise in importance of machine based classification and coding.  This was emphasized by the keynote speech given by Judge Andrew Peck at the Carmel Valley eDiscovery Retreat in July.  He gave what is considered one of the clearest statements yet by a judge that the use of new technology like predictive coding is an acceptable way to conduct search in appropriate civil litigation cases.  He memorialized his thoughts in a well written article a few months later.[2]

Issues about managing eDiscovery in the cloud and for social media gained attention due to cases such as Social Networking Katiroll Co., Inc. v. Kati Roll and Platters, Inc.[3] (Court ordered an individual defendant to re-post the previous Facebook profile picture to allow the plaintiff to print any posts it felt were relevant), Offenback v. L.M. Bowman, Inc.[4] (Judge ordered production of photographs and postings on Facebook), Piccolo v. Patterson[5] (Motion to compel Plaintiff to accept defense attorneys “friend request “was denied) and Zimmerman v. Weis Markets, Inc.[6] (Plaintiff ordered to provide all passwords and usernames for Facebook and MySpace to the defendant and preserve all existing information). The solutions to address these challenges will undoubtedly gain mindshare in the new year.

Some of the dominant points of contention in 2011 were around cost shifting and taxation of eDiscovery costs. The courts did not settle on a unanimous viewpoint regarding these concerns however a number of cases seem to indicate a shift toward taxing of eDiscovery costs to the losing party.[7]

The top concerns in 2011 however were around preservation and proportionality.  Those concerns were highlighted in the recent case of Pippins v. KPMG LLP .[8]   In this case, the plaintiffs, potential members of a nationwide FLSA collective and/or a putative New York State class, challenged KPMG’s treatment of accountants in its audit practice.  KPMG tried to narrow the scope of its preservation obligation to a random sample of 100 hard drives from among those it had already preserved for this and other litigation. Alternatively, KPMG sought to shift the cost of any preservation beyond the scope it had suggested because there were more than 7,500 potential opt-in plaintiffs to the FLSA collective nationwide. Not only did the Magistrate Judge order that KPMG must preserve the hard drives of computers used by anyone who might join in the action, he also denied their request for cost shifting.

Perhaps it is because of cases like these that the single biggest news this year is that the Advisory Committee on Civil Rules is considering amending the Federal Rules of Civil Procedure (FRCP) once again.   The process has energized many in the legal community to provide their input and influence in various forums to make sure their point of view is considered.[9] Some decision should come out of the committee early next year.

In all, with 2011 being such a bumper year for eDiscovery, 2012 will undoubtedly provide many new developments to keep eDiscovery professionals on their toes.


[1] 2011 W.L. 38162510 (S.D.N.Y. Feb. 7, 2011)

[3] 2011 WL 3583408 (D.N.J. Aug. 3, 2011)

[4] 2011 WL 2491371 (M.D. Pa. June 22, 2011)

[5] Case No. 2009-04979 (Pa. Com. Pl. Order of May 5, 2011)

[6] 2011 W.L. 2065410 (Pa. Com. Pl. 19 May 2011)

[7] Race Tires Am., Inc. v. Hoosier Racing Tire Corp., 2011 WL 1748620 (W.D. Pa. May 6, 2011); Jardin v. DATAllegro, Inc., 2011 WL 4835742 (S.D. Cal. Oct. 12, 2011); Promote Innovation LLC v. Roche Diagnostics Corp., 2011 WL 3490005 (S.D. Ind. Aug. 9, 2011)

[8] No. 11 Civ. 0377 (CM)(JLC), 2011 WL 4701849 (S.D.N.Y. Oct. 7, 2011)

[9] For more on this topic see: http://bit.ly/vDGhO3


To The Cloud and Back Again! – SQL Saturday # 104

For those that are not aware, the Professional Association for SQL Server (PASS) has chapters throughout the world that put on one day events called “SQL Saturday”. As the name implies, these events take place on a Saturday and generally are a full day of targeted learning for those who want to know more about SQL Server and SQL Server technologies.

This year, the first US SQL Saturday event (There is also an event in Bangalore that same day, and given the time zones, I’d say they qualify as the “first” one of the year!)  is happening right here in Colorado Springs! SQL Saturday #104 has a very distinguished list of speakers, including people like Jason Horner, TJ Belt, Chris Shaw, Thomas LaRock, Karen Lopez and a whole host of very impressive speakers. There’s going to be 5 simultaneous tracks and somehow they even invited me to speak as well, so I’ll be speaking at 0830 in room #4 on “To the Cloud and Back Again!”.

Session Description

In this session, I’ll be introducing some basic Cloud Computing patterns and will talk about some specific cloud computing security concerns. I’ll then talk about some of the specific technologies that accompany the Windows Azure and SQL Azure platforms that enable a hybrid approach to cloud computing. I’ll demonstrate how Windows Azure roles can be “Domain Joined” that will then allow Azure-based applications to use SQL Server Trusted Connections to connect to on-premises SQL Server databases. All in all I hope it will be a very informative session on Cloud Computing technologies. Hope to see you there!

image

Hartford SQL Server Meetup!

Chuck Boyce has created a SQL Server meetup.  He spun up the Philadelphia SSUG, and I’m hoping he can weave his magic for Hartford!

Our first event will be on January 24th at the EMC office in Rocky Hill at 7pm.  I’ll be presenting on storage basics for DBAs, including techniques to analyze and improve storage performance and availability, various techniques for disaster recovery and application failover.  We’ll have a few giveaways and some food.

Make sure to RSVP here at meetup if you’d like to attend so we can make sure you get information around how get there and into the building.  We look forward to seeing you there!

A New Year’s Resolution for Records Managers

President Obama recently challenged every agency / organization of the federal government to define what they were going to do to improve their organization’s records management programs to ensure federal records were retained, preserved and protected for later discovery and response to public inquires.

I for one believe that the President’s challenge should be embraced by every professional records manager in the United States, if not the world.  Here is one national leader, who openly acknowledges that records management within the US Federal Government is ripe for improvement.  Similar improvements are called for in the private, state and local government sectors of this country.  By-in-large, this profession is due for a significant shot in the arm and this Presidential challenge is that opportunity, so let’s make the most of it.

We have all seen New Year’s Resolutions for most everything one can imagine so, why haven’t we seen one for Records Management Programs?

Well, no longer, let’s try this one:

“In 2012, I vow to work towards improving my organizations ability to effectively and efficiently retain, preserve, protect and disposition our organization’s records in accordance with our approved retention schedule.”  I will accomplish this by employing some of the following steps:

  1. Expose what it is costing my organization today to do business without effective records management;
  2. Perform a prescriptive self assessment on my organization’s current records management program against a commonly accepted set of criteria, such as ARMA International’s GARP (Generally Accepted Records Principles);
  3. Develop a return-on-investment (ROI), business case and total-cost-of-ownership (TCO) based upon the information uncovered in the first two steps and develop an Executive level presentation which clearly and concisely articulates those savings for my organization’s Executives;
  4. Work to build out our organization’s information governance program based upon where we discover we are at today (following the self-assessment), not where you want to be.  Enhancements to our information governance program can and should take place as our program matures and execution is facilitated by automation when and where appropriate;
  5. Begin immediately to limit the publication and processing of hard copy documents and presume that information should progress though its lifecyle in electronic form rather than paper, whenever practical and permitted by law; and
  6. Employ a tool like contextual analytics / file Intelligence to analyze, index and categorize our organization’s information in various high priority repositories so, that decision makers know specifics about what information we have, in order to make the right decisions about what to do with it.  Decision makers must have this kind of detail in order to determine:
  • what  information is a business record or not, if not a business record  we can likely defensively depose of them in-place;
  • if it is a business record, and we already have a copy of it, we do not need to keep it and can defensively dispose of the duplicates in-place; and
  • if the business record is not a duplicate however, it has already satisfied its retention requirements, we can then defensively dispose of it in-place.

If the discovered business record is currently in an existing inadequately secured repository, we will then take this opportunity to move the business record to a secure repository, where it can be secured, so that no unauthorized personnel (including the owner of the record) can change or delete the record until which time it has satisfied its retention period.

So, think about this.  Maybe, everything here is too much you and your organization to take on in one year.  If that is the case, start at the top of the list and do what you can.  Just remember that old saying, “A decision not to decide is still a decision” and one that has little to show for it. 

Challenge yourself, that is the only way you are going to be able to improve things records management wise within your organization.  Don’t forget ROI and TCO they are your keys to successfully defending your efforts and the resources you will be asking your management to spend.

HAPPY NEW YEAR and let me know if I can help you, (having successfully done this seven times in my 30+ years implementing records management programs in both the public and private sectors.)