# Calculating mode in data mining projects

## Data mining

### Looking for something else?

Data Mining: Know it All

In this chapter from Data Mining: Know it All, learn learn how to calculate the average value or the number of records that represent certain events during the data rollup process.

### 2.5 Rollup with Sums, Averages, and Counts

In addition to finding the sum of a value variable during the rollup, it may also be more meaningful sometimes to calculate average value or the number of records that represent certain events -- for example, number of deposits, number of withdrawals, or number of mailings a customer received responding to an offer.

An introduction to data mining

Fielded applications of data mining and machine learning

The difference between machine learning and statistics in data mining

Information and examples on data mining and ethics

Data acquisition and integration techniques

What is a data rollup?

Calculating mode in data mining projects

Using data merging and concatenation techniques to integrate data

In our rollup macro, these requirements would alter only the middle part of our code, where we calculated the cumulative value of the Value variable. The following code segment would modify the macro to calculate the average value and the number of transactions for each account type instead of the total:

Step 2
data Temp1;
retain TOT 0;
retain NT 0;
set & TDS;
by & IDVar & TimeVar & TypeVar;
if fi rst. & TypeVar then TOT = 0;
TOT = TOT + & Value;
if & Value ne . then NT = NT + 1;
if last. & TypeVar then
do;
AVG = TOT/ NT;
output;
NT = 0;
end;
drop & Value;
run;

Furthermore, the code inside the %do loop should also reflect our interest in transposing the values of the average variable, _AVG. Therefore, the code will be as follows:

Step 4
%do i = 1 %to & N;
0proc transpose
data = Temp1
out = R & i
prefi x = %substr( & & Cat & i, 1, & Nchars) ;
by & IDVar & TypeVar;
ID & TimeVar;
var AVG;
where & TypeVar = " & & Cat & i " ;
run;
%end;

The complete code for the modified code to roll up the average value is included in the macro ABRollup() .

### 2.6 Calculation of the Mode

Another useful summary statistic is the mode, which is used in both the rollup stage and the event-driven architecture (EDA). The mode is the most common category of transaction. The mode for nominal variables is equivalent to the use of the average or the sum for the continuous case. For example, when customers use different payment methods, it may be beneficial to identify the payment method most frequently used by each customer.

The computation of the mode on the mining view entity level from a transaction dataset is a demanding task because we need to search for the frequencies of the different categories for each unique value of the entity variable. The macro shown in Table 2.5 is based on a classic SQL query for finding the mode on the entity level from a transaction table. The variable being searched is XVar, and the entity level is identified through the unique value of the variable IDVar:

%macro VarMode(TransDS, XVar, IDVar, OutDS);
/ * A classic implementation of the mode of transactional
data using SQL * /
proc sql noprint;
create table & OutDS as

Table 2.5 Parameters of VarMode ( ) Macro

 Header Parameter VarMode (TransDS, XVar, IDVar, OutDS) Description TransDS Input transaction dataset XVar Variable for which the mode is to be calculated IDVar ID variable OutDS The output dataset with the mode for unique IDs

SELECT & IDVar , MIN( & XVar ) AS mode
FROM (
SELECT & IDVar, & XVar
FROM & TransDS p1
GROUP BY & IDVar, & XVar
HAVING COUNT( * ) =
(SELECT MAX(CNT )
FROM (SELECT COUNT( * ) AS CNT
FROM & TransDS p2
WHERE p2. & IDVar = p1. & IDVar
GROUP BY p2. & XVar
) AS p3
)
) AS p
GROUP BY p. & IDVar;
quit;
%mend;

The query works by calculating a list holding the frequency of the XVar categories, identified as CNT, then using the maximum of these counts as the mode. The query then creates a new table containing IDVar and XVar where the XVar category frequency is equal to the maximum count, that is, the mode.

#### More on data mining:

The preceding compound SELECT statement is computationally demanding because of the use of several layers of GROUP BY and HAVING clauses. Indexing should always be considered when dealing with large datasets. Sometimes it is even necessary to partition the transaction dataset into smaller datasets before applying such a query to overcome memory limitations.

This was last published in May 2009

## Content

Find more PRO+ content and other member only offers, here.

#### Start the conversation

Send me notifications when other members comment.

## SearchDataManagement

• ### Microsoft SQL Server 2016 relational DBMS overview

Microsoft SQL Server 2016 for Windows comes in four editions, with updates that include a new stretch database feature, Polybase,...

• ### Amazon ushers in pay-by-query pricing with Athena data engine

Amazon's Athena data engine brings interactive SQL queries to S3. It's based on an open source framework called Presto that ...

• ### A look at the EnterpriseDB Postgres data management platform

The EDB Postgres platform, which is based on the PostgreSQL open source relational database, is offered as a subscription service...

## SearchAWS

• ### New AWS developer tools offer better app insights, control

New AWS developer tools and data efforts will help users who want more hands-on control, but enterprises should also expect a ...

• ### AWS bets on Chef automation tool with OpsWorks update

Amazon refreshed its Chef-based AWS OpsWorks offering and also has made a minority investment in Chef Software as part of the ...

• ### AWS enhancements beckon customers to Amazon cloud platform

AWS intends to make it easier than ever for an enterprise to use its platform -- on or off premises.

## SearchContentManagement

• ### Gemological Institute of America gives content automation high grades

The Gemological Institute of America used Quark to automate content, cutting the process by a third.

• ### Five steps for designing an effective BI dashboard

The BI dashboard can be a very useful tool for data engagement if the dashboard is designed and used effectively. Here are five ...

• ### Examining the top offerings in the WCM platform marketplace

Once you've decided that a WCM platform is right for your business, it's time to decide which to choose. Expert Geoffrey Bock ...

## SearchCRM

• ### AI, IoT, intelligent systems take center stage in 2017 technology trends

Experts held forth on the promise and pitfalls of technologies that are changing today's environment at the Gilbane conference.

• ### What contact center software to buy in 2017

If your contact center software needs updating, this tip will help you focus on some of the must-have technology to buy in 2017.

• ### Insightly adds CRM security to its product

With smaller security budgets and IT teams than enterprise businesses, small and medium-sized companies are more vulnerable to ...

## SearchOracle

• ### Oracle buys Dyn DNS house to up its cloud game

To flesh out its cloud product line, Oracle is adding Dyn's DNS services to offer similar internet capabilities as rivals AWS and...

• ### Database Performance Analyzer pumps up Bodybuilding.com's performance

Sean Scott, DBA for Bodybuilding.com, wanted a performance tool that ran like it was designed by a DBA. He chose SolarWinds DPA ...

• ### Oracle cloud architecture push spawns new tools, issues for users

The cloud is now Oracle's top strategic priority, and users have to decide if they're ready to migrate. This guide offers a ...

## SearchSAP

• ### Does SAP ONE Support Launchpad make SAP support any easier to use?

The new Fiori user experience makes it easier to access applications and support services, but product-specific support still ...

• ### At SME Summit, SAP Anywhere growing pains revealed

Attendees at an SAP SME Summit lauded the e-commerce front end, but some said back-end integration and payment-processing ...

• ### Planning a HANA big data strategy with SAP HANA Vora

SAP has worked hard to position HANA as a big data platform. To formulate a viable big data strategy, you need to know the tools,...

## SearchSQLServer

• ### How to get the most out of virtual SQL Server with Microsoft Hyper-V

SQL Server is a CPU-intensive technology, which can make it tricky to run in a virtualized environment. Keep your SQL Server ...

• ### Microsoft previews SQL Server on Linux, opens features across editions

Microsoft looks to broaden the horizons of SQL Server, as it moves some Enterprise features to Standard Edition and issues the ...

• ### Cask framework aims to speed Azure HDInsight data pipeline builds

A link between Cask Data's CDAP application and integration environment and Azure HDInsight, Microsoft's Hadoop cloud service, is...