[S] Call for Participation: KDD-CUP-98

Ismail Parsa (iparsa@epsilon.com)
Mon, 6 Jul 1998 14:51:46 -0400

| |
| KDD-CUP-98 |
| |
| The Second International Knowledge Discovery and |
| Data Mining Tools Competition |
| |
| Held in Conjunction with KDD-98 |
| |
| The Fourth International Conference on Knowledge |
| Discovery and Data Mining |
| [www.kdnuggets.com] or |
| [www-aig.jpl.nasa.gov/kdd98] or |
| [www.aaai.org/Conferences/KDD/1998] |
| |
| Sponsored by the |
| |
| American Association for Artificial Intelligence (AAAI) |
| Epsilon Data Mining Laboratory |
| Paralyzed Veterans of America (PVA) |

KDD-CUP is a knowledge discovery and data mining (KDDM) tools
competition held in conjunction with the International Conference on
Knowledge Discovery and Data Mining.

Last year, the CUP enjoyed worldwide participation of 45 data mining
tools. The Gold Miner award was jointly shared by UCSD's BNB (Boosted
Naive Bayes Classifier) software and Urban Science's GainSmarts
software. SGI's MineSet was the runner-up and has earned the Bronze
Miner award. For more information on KDD-CUP-97, please refer to the
URL: www.epsilon.com/new. Some of the highlights from last year's
competition are as follows:

o The success of the Naive Bayes algorithm (used by 2 of the top 3

o No clear evidence backing the hypothesis that there are "real"
returns to incremental data preprocessing activity.

KDD-CUP-98 will follow on the success of last year's competition. The
CUP is again open to all KDDM tool vendors, academics with research
prototypes and corporations with significant applications. Attendance
to the KDD-98 conference is not required to participate in the CUP.

| KDD-CUP Process and Important Dates |

o Registration and signing of the NDA (Non-Disclosure Agreement)
July 1-15, 1998

o Release of the datasets (learning and validation), related
documentation and the KDD-CUP questionnaire
July 16, 1998

o Return of the results and the KDD-CUP questionnaire
August 14, 1998

o KDD-CUP Committee evaluation of the results
August 15-25

o Individual performance evaluations send to the participants
August 25, 1998

o Public announcement of the winners and awards presentation during
KDD-98 in New York City
August 29, 1998

| KDD-CUP Data Set |

The data set for this year's Cup has been generously provided by the
Paralyzed Veterans of America (PVA). PVA is a not-for-profit
organization that provides programs and services for US veterans with
spinal cord injuries or disease. With an in-house database of over 13
million donors, PVA is also one of the largest direct mail fund
raisers in the country.

Participants in the CUP will demonstrate the performance of their tool
by analyzing the results of one of PVA's recent fund raising appeals.
This mailing was dropped in June 1997 to a total of 3.5 million PVA
donors. It included a gift "premium" of personalized name & address
labels plus an assortment of 10 note cards and envelopes. All of the
donors who received this mailing were acquired by PVA through
premium-oriented appeals like this.

The analysis data set will include:

o A subset of the 3.5 million donors sent this appeal

o A flag to indicate respondents to the appeal and the dollar amount
of their donation

o PVA promotion and giving history

o Overlay demographics, including a mix of household and area level

Unlike least year, all available information about the fields will be
made available in the project documentation.

The objective of the analysis will be to identify response to this
mailing -- a classification or discrimination problem.

| Performance Evaluation Criteria |

The CUP is aimed at recognizing the most accurate, innovative,
efficient and methodologically advanced data mining tools in the

The participants will again be evaluated based on the performance of
their algorithm on the validation or hold-out data set. The KDD-CUP
program committee will consider the following metrics in their

o Lift curve or gains table analysis listing the cumulative percent of
targets recovered in the top quantiles of the file

o Receiver operating characteristics (ROC) curve analysis and the area
under the ROC curve

o Several statistical tests to ensure the robustness of the results.

Last year, the performance in the top 10 percent of the file was
considered as a measure of precision while the performance in the top
40 percent of the file was considered as a measure of stability and
marketing coverage. The average performance up to the 40th percentile
was also looked at as a measure of overall performance.

| KDD-CUP-97 Program Committee |

o Vasant Dhar, New York University, New York, NY
o Tom Fawcett, Bell Atlantic, New York, NY
o Georges Grinstein, University of Massachusetts, Lowell, MA
o Ismail Parsa, Epsilon, Burlington, MA
o Gregory Piatetsky-Shapiro, Knowledge Stream Partners, Boston, MA
o Foster Provost, Bell Atlantic, New York, NY
o Kyusoek Shim, Bell Laboratories, Murray Hill, NJ


All participants are required to complete the application form below
and send it in plain ASCII format to (e-mail preferred):

| Ismail Parsa |
| |
| Epsilon |
| 50 Cambridge Street |
| Burlington MA 01803 USA |
| |
| E-MAIL: iparsa@epsilon.com |
| V-MAIL: (781) 273-0250*6734 |
| FAX: (781) 272-8604 |

The participants will receive the NDA (non-disclosure agreement)
before the July 15, 1998 deadline. Please contact Ismail Parsa if you
did not receive the NDA before July 15.

Last year, the KDD-CUP program committee publicly announced the names
of only the top 3 performing tools. The names of the 45 participants
were not released. This year, although we will again only announce
the names of the top 3 performing tools, we will make the list of
participants publicly available UNLESS THE PARTICIPANTS INDICATE THAT
THE REGISTRATION BROCHURE. We think it's fair for everyone to know
who they are competing with.

-------------------------------- cut ---------------------------------


Registration Brochure

Name of software/product/tool/research prototype:_____________________

Name of vendor/institution:___________________________________________

KDD-CUP program committee will only announce the names of the top 3
performing tools. However, we intend to make the list of participants
publicly available based on the box checked below. Please check the
appropriate box:

(_) List my tool's name as a participant
(_) Do not list my tool's name as a participant. I wish to stay

Status of software/product/tool/research prototype:

(_) Alpha (_) Beta (_) Production

Release date of software/product/tool/research prototype (in YYMM or
year/month format):___________________________________________________

Platform availability (check all that apply):

(_) PC (_) UNIX (_) Mainframe (_) Parallel (SMP/MPP) (_) Other

Systems architecture (check all that apply):

(_) Client/Server (_) PC client only (_) UNIX Client only
(_) PC/UNIX server only

Built-in knowledge discovery and data mining methodology/technology
(check all that apply):

(_) Graphical User Interface (GUI)
(_) Data Access to RDBMSs
(_) Data Management (data processing, SQL, merge, summarize,
aggregate, sorting, ranking, etc.)
(_) Data Selection (random sampling, Nth selection, etc.)
(_) Data Preprocessing (missing value/outlier treatment, symbol
mapping, binning/discretization, normalization, etc.)
(_) Exploratory Data Analysis (descriptive statistics, data/
knowledge visualization, etc.)
(_) Collinearity Screening/Redundancy Elimination
(_) Variable Subset Selection
(_) Link Analysis (Associations, Sequences, etc.)
(_) Clustering or Segmentation (K-means, Kohonen clustering, etc.)
(_) Time Series Analysis
(_) Classification or Discrimination (for categorical/symbolic
(_) Prediction or Regression (for continuous/numeric targets)
(_) Multiple Learned or Combined Models (boosting, arching,
bagging, etc.)
(_) Data Postprocessing (model deployment/scoring, modeling
project manager, model performance tracking, link to
business process etc.)
(_) Other, specify:______________________________________________

Data mining algorithms (check all that apply and specify the
algorithm(s) in the space provided):

(_) Supervised Neural Networks (MLP, RBF, etc.):_________________
(_) Statistical Methods (Logistic, OLS, MARS, PPR, GAM, Nearest
Neighbors, etc.):____________________________________________
(_) Decision Trees and Rules (ID3, C4.5, CHAID, CART,
(_) Hybrid Systems (Neuro-fuzzy systems, GA optimized neural/
decision tree systems, etc.):________________________________
(_) Case-Based Reasoning
(_) Other Supervised Methods (Bayesian methods, decision tables,
(_) Unsupervised Algorithms (Kohonen networks, K-means
clustering, SOM, etc.):______________________________________
(_) Associations and Sequence Discovery:_________________________
(_) Other, specify: _____________________________________________

Note: The numbers requested below will only be used to compute
participant summary statistics and do not serve any other

Is your software/product/tool/research prototype:

A freeware: (_) Yes (_) No
Commercially available for purchase: (_) Yes (_) No
If 'yes' to above, Price (in US$):___________________________
Number of sites installed:_______________________________________

Other relevant information:___________________________________________


E-mail Address................:
Phone Number..................:
FAX Number....................:
Name of Company/Institution...:

Mailing Address...............:


E-mail Address................:
Phone Number..................:
Name of Company/Institution...:

Mailing Address...............:

---------------------------------- cut ---------------------------------

This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news