# Starting from dimensionality reduction

Feature selection is a part technique of *data dimensional reduction*.
According to the book `Data minging: concepts and techniques`

, the most ubiquitous methods are:

- wavelet transforms
- principal components analysis (PCA)
- attribute subset selection(or feature selection)

It is worth mentioning, that **PCA**, **Exploratory Factor Analysis (EFA)**, **SVD**, etc are all methods which reconstruct our original attributes. PCA is essentially creates new variables that are linear combinations of the original variables.

However, if we want to reserve the original attributes, then take a look at **Feature selection**.

# Overview of Feature Selection

There exist several ways to category the techniques of feature selection: 1 2

Yet From the problem solving prospective,I divide the part of techniques into those ways:

**Supervised(regression)**: LASSO, REF, Autoencoder, etc. The regression area has been investigated extensively more information**Unsupervised**: principal feature analysis(PFA)

# Concepts of unsupervised method(PFA)

Steps:

- Compute the sample covariance matrix,
- Compute the Principal components and eigenvalues of the
Covariance /Correlation matrix
**A**. - Choose the subspace dimension
**n**, we get new matrix**A_n**, the vectors**Vi**are the rows of**A_n**. - Cluster the vectors
**|Vi|**, using K-Means - For each cluster, find the corresponding vector
**Vi**which is closest to the mean of the cluster.

# Code

```
class PFA(object):
def __init__(self, n_features, q=None):
self.q = q
self.n_features = n_features
def fit(self, X):
if not self.q:
self.q = X.shape[1]
sc = StandardScaler()
X = sc.fit_transform(X)
pca = PCA(n_components=self.q).fit(X)
A_q = pca.components_.T
kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
clusters = kmeans.predict(A_q)
cluster_centers = kmeans.cluster_centers_
dists = defaultdict(list)
for i, c in enumerate(clusters):
dist = euclidean_distances([A_q[i, :]], [cluster_centers[c, :]])[0][0]
dists[c].append((i, dist))
self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
self.features_ = X[:, self.indices_]
```

the usage:

```
pfa = PFA(n_features=10)
pfa.fit(dataset)
# To get the transformed matrix
x = pfa.features_
# To get the column indices of the kept features
column_indices = pfa.indices_
```

# Conclusion

Next time we’ll take a closer look at supervised method.