Data Mining, Quant, Statistics, Computer Science: Jobs, Resumes, Directory

Precision Recruiting

Data Mining

Contest

Math Jobs

Site Map

[ Home ]

[ Finance ]

[ Web Audit ]

[ Consulting ]

These are difficult mathematical questions. They are arising from real applications such as fraud detection, arbitrage and scoring systems. If you have interesting answers to any questions, feel free to email us your comments or solution. The best answers will be published here. Companies and Organizations interested in submitting problems should E-mail us.

Approximate Solutions to Linear Regression Problems

Here we assume that we have a first order solution to a regression problem, in the form

Y = Σ w_i R_i,
where Y is the response, w_i are the regression coefficients, and R_i are the independent variables. The number of variables is very high, and the independent variables are highly correlated. We want to improve the model by considering a second order regression of the form

Y = Σ w_i R_i + Σ w_ij c_ij m_ij R_i R_j ,
where

c_ij = correlation between R_i and R_j
w_ij = |w_iw_j|^0.5 x sgn(w_iw_j)
m_ij are arbitrary constants
In practice, some of the R_is are highly correlated and grouped into clusters. These clusters can be identified by using a clustering algorithm on the c_ijs. For example, one could think of a model with two clusters A and B such as

Y = Σ w_i R_i + m_A Σ_A w_ij c_ij R_i R_j + m_B Σ_B w_ij c_ij R_i R_j
where

Σ_A (resp. Σ_B) are taken over all i < j belonging to A (resp. B)
m_ij = m_A (constant) if i, j belong to cluster A
m_ij = m_B (constant) if i, j belong to cluster B
An interesting case occurs when the cluster structure is so strong that

|c_ij| = 1 if i and j belong to the same cluster (either A or B)
c_ij = 0 otherwise
This particular case results in

m_A = 4 / [1 + (1+8k_A)^0.5]
m_B = 4 / [1 + (1+8k_B)^0.5]
where k_A= Σ_A |c_ij| and k_B= Σ_B |c_ij|.
Question
If the cluster structure is moderately strong, with the correlations c_ij close to 1, -1 or 0, how accurate is the above formula involving k_A and k_B? Here we assume that the w_is are known or approximated. Typically, w_i is a constant or w_i is a simple function of the correlation between Y and R_i.
Alternate Approach
Let us consider a simplified model involving one cluster, with m_ij = constant = m. For instance, the unique cluster could consist of all variables i, j with |c_ij| > 0.70. The model can be written as

Y = Σ w_i R_i + m Σ w_ij c_ij R_i R_j.
We want to find m that provides the best improvement over the first order model, in terms of residual error. The first order model corresponds to m = 0. Let us introduce the following notations:

W = Σ w_ij c_ij R_i R_j,
V = W - u, where u = average(W) (Thus V is the centered W, with mean 0),
S= Σ w_i R_i. (average(S) = average(Y) by construction)
Without loss of generality, let us consider the slightly modified (centered) model

Y = S + m V.
Then

m = [ Transposed(V) x (Y-S) ] / [ Transposed(V) x V ], where

Y, S, and V are vectors with n rows,
n is the number of observations.
Further Improvements
The alternate approach could be incorporated in an iterative algorithm, where at each step a new cluster is added. So at each step we would have the same computation for m, optimizing the residual error on

Y = S + m V.
However this time, S would contain all the clusters detected during the previous step, and V would contain the new cluster being added to the model.

Data Mining • Machine Learning • Analytics • Quant • Statistics • Econometrics • Biostatistics • Web Analytics • Business Intelligence • Risk Management • Operations Research • AI • Predictive Modeling • Actuarial Sciences • Statistical Programming • Customer Insight • Data Modeling • Competitive Intelligence • Market Research • Information Retrieval • Computer Science • Retail Analytics • Healthcare Analytics • ROI Optimization • Design Of Experiments • Scoring Models • Six Sigma • SAS • Splus • SAP • ETL • SPSS • CRM • Cloud Computing • Electrical Engineering • Fraud Detection • Marketing Databases • Data Analysis • Decision Science • Text Mining