Manage Learn to apply best practices and optimize your operations.

Calculating mode in data mining projects

Data mining

Looking for something else?

Data Mining: Know it All

In this chapter from Data Mining: Know it All, learn learn how to calculate the average value or the number of records that represent certain events during the data rollup process.

2.5 Rollup with Sums, Averages, and Counts

In addition to finding the sum of a value variable during the rollup, it may also be more meaningful sometimes to calculate average value or the number of records that represent certain events -- for example, number of deposits, number of withdrawals, or number of mailings a customer received responding to an offer.

An introduction to data mining

Fielded applications of data mining and machine learning

The difference between machine learning and statistics in data mining

Information and examples on data mining and ethics

Data acquisition and integration techniques

What is a data rollup?

Calculating mode in data mining projects

Using data merging and concatenation techniques to integrate data

In our rollup macro, these requirements would alter only the middle part of our code, where we calculated the cumulative value of the Value variable. The following code segment would modify the macro to calculate the average value and the number of transactions for each account type instead of the total:

Step 2
data Temp1;
retain TOT 0;
retain NT 0;
set & TDS;
by & IDVar & TimeVar & TypeVar;
if fi rst. & TypeVar then TOT = 0;
TOT = TOT + & Value;
if & Value ne . then NT = NT + 1;
if last. & TypeVar then
do;
AVG = TOT/ NT;
output;
NT = 0;
end;
drop & Value;
run;

Furthermore, the code inside the %do loop should also reflect our interest in transposing the values of the average variable, _AVG. Therefore, the code will be as follows:

Step 4
%do i = 1 %to & N;
0proc transpose
data = Temp1
out = R & i
prefi x = %substr( & & Cat & i, 1, & Nchars) ;
by & IDVar & TypeVar;
ID & TimeVar;
var AVG;
where & TypeVar = " & & Cat & i " ;
run;
%end;

The complete code for the modified code to roll up the average value is included in the macro ABRollup() .

2.6 Calculation of the Mode

Another useful summary statistic is the mode, which is used in both the rollup stage and the event-driven architecture (EDA). The mode is the most common category of transaction. The mode for nominal variables is equivalent to the use of the average or the sum for the continuous case. For example, when customers use different payment methods, it may be beneficial to identify the payment method most frequently used by each customer.

The computation of the mode on the mining view entity level from a transaction dataset is a demanding task because we need to search for the frequencies of the different categories for each unique value of the entity variable. The macro shown in Table 2.5 is based on a classic SQL query for finding the mode on the entity level from a transaction table. The variable being searched is XVar, and the entity level is identified through the unique value of the variable IDVar:

%macro VarMode(TransDS, XVar, IDVar, OutDS);
/ * A classic implementation of the mode of transactional
data using SQL * /
proc sql noprint;
create table & OutDS as

Table 2.5 Parameters of VarMode ( ) Macro

 Header Parameter VarMode (TransDS, XVar, IDVar, OutDS) Description TransDS Input transaction dataset XVar Variable for which the mode is to be calculated IDVar ID variable OutDS The output dataset with the mode for unique IDs

SELECT & IDVar , MIN( & XVar ) AS mode
FROM (
SELECT & IDVar, & XVar
FROM & TransDS p1
GROUP BY & IDVar, & XVar
HAVING COUNT( * ) =
(SELECT MAX(CNT )
FROM (SELECT COUNT( * ) AS CNT
FROM & TransDS p2
WHERE p2. & IDVar = p1. & IDVar
GROUP BY p2. & XVar
) AS p3
)
) AS p
GROUP BY p. & IDVar;
quit;
%mend;

The query works by calculating a list holding the frequency of the XVar categories, identified as CNT, then using the maximum of these counts as the mode. The query then creates a new table containing IDVar and XVar where the XVar category frequency is equal to the maximum count, that is, the mode.

More on data mining:

The preceding compound SELECT statement is computationally demanding because of the use of several layers of GROUP BY and HAVING clauses. Indexing should always be considered when dealing with large datasets. Sometimes it is even necessary to partition the transaction dataset into smaller datasets before applying such a query to overcome memory limitations.

This was last published in May 2009

Content

Find more PRO+ content and other member only offers, here.

Start the conversation

Send me notifications when other members comment.

SearchDataManagement

• Streaming tool from StreamSets eyes data in motion for GDPR

StreamSets software for inspecting big data brings governance to data in motion. Such capabilities may find more use as the ...

• Data expert: GDPR deadline is an opportunity, not a burden

There is stress as the EU's General Data Protection Regulation compliance deadline nears, but the GDPR privacy movement is a good...

In big data news, we find Google TPUs, or Tensor Processing Units, offered as a cloud service, while LinkedIn is open sourcing a ...

SearchAWS

• Try this data science experiment for deep learning insights

Deep learning professionals require a specific set of skills and tools. Follow this simple example of a data science project to ...

• Zero trust model boosts AWS infrastructure security

Security worries slowed early cloud adoption, and while many organizations eventually migrated, they failed to lock down access. ...

• VMware cloud services tidy up AWS migration path

VMware Cloud on AWS has improved migrations, cost controls and availability, as VMware pitches customers on the value of hosting ...

SearchContentManagement

• Scrivito unveils serverless CMS product

By building the CMS with ReactJS, Scrivito gained attraction with development community, according to an analyst.

• Content personalization tools sharpen focus on customers

Content personalization isn't new; Amazon weaponized it, and Jeff Bezos is the world's richest man. New tools are putting it ...

• Leading brands see the need for personalized content

Content personalization continues to expand within companies as maturing technologies make it a viable marketing option for ...

SearchCRM

• Social media customer care important to brands, experts say

In this Pipeline podcast, we discuss what was a primary topic at Social Media Marketing World: how companies should interact with...

• Inside sales strategy, millennials highlight Accelerate 2018

Learn what happens when a SearchCRM editor joins a panel as the paid skeptic to discuss whether AI technologies provide real ...

• Digital sales transformation may require shadow IT dark ops

Sales tech initiatives culminating with flipping the AI switch require operational and IT prep involving data aggregation, ...

SearchOracle

• Oracle cloud services aim to pull users out of IT comfort zone

Oracle has gone all-in on cloud computing and expanded its line of cloud services in the hopes users will make the jump. But ...

• Seven to-do items for managing an Oracle cloud migration

Migrating Oracle systems to the cloud is a big initiative for any organization. Make sure you put a solid migration plan in place...

• Oracle sees Autonomous Database changing DBA roles for the good

The job duties of database administrators will likely change in organizations that use Oracle Autonomous Database Cloud, but in ...

SearchSAP

• SAP Ariba Live focuses on procurement for purpose

SAP Ariba Live 2018 focused on the idea that businesses can use procurement technology to do good in the world; for example, by ...

• SAP debuts consumption-based pricing model for SAP Cloud

SAP Cloud Platform is now available as a consumption-based model, an alternative to the subscription model. SAP also updated the ...

SAP paid \$2.4 billion to acquire lead-to-money vendor CallidusCloud, and analysts agree that the significant price may be worth ...

SearchSQLServer

SQL Operations Studio simplifies routine administration of SQL Server and Azure SQL databases, making database development and ...

• Meltdown and Spectre fixes eyed for SQL Server performance issues

Microsoft has responded to the Spectre and Meltdown chip vulnerabilities with patches and other fixes. But IT teams need to sort ...

• Five SQL Server maintenance steps you should take -- ASAP

Putting off SQL Server administration tasks can lead to database problems. Enact these often-neglected maintenance items to help ...

SearchSalesforce

• Salesforce small business tools get major refresh

Emphasizing the 'S' in 'SMB,' Salesforce goes after companies with 20 employees or less with Salesforce Essentials -- with hooks ...

• Salesforce e-commerce gets B2B boost with CloudCraze purchase

To improve its e-commerce offerings, the CRM company bought B2B commerce product CloudCraze for an undisclosed amount. The ...

• Salesforce GDPR marketing compliance involves rethinking lead gen

The bigger -- and more complicated -- your marketing tech stack, the more complex your Salesforce GDPR compliance will be, ...

Close