Mathematical Insights#
[1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from optimask import OptiMask
[2]:
def plot(data, rows_to_keep=None, cols_to_keep=None, rows_to_remove=None, cols_to_remove=None, figsize=None, title=None, xticks=None, yticks=None, show=True):
cmap = plt.get_cmap("coolwarm")
cmap.set_bad('grey')
x = data.copy()
m, n = data.shape
if rows_to_keep is not None:
x[[i for i in range(m) if i not in rows_to_keep]] += 1
if cols_to_keep is not None:
x[:, [j for j in range(n) if j not in cols_to_keep]] += 1
if rows_to_remove is not None:
x[rows_to_remove] += 1
if cols_to_remove is not None:
x[:, cols_to_remove] += 1
if figsize:
plt.figure(figsize=figsize)
plt.pcolormesh(np.clip(x, 0, 1), cmap=cmap)
plt.gca().set_aspect('equal')
plt.title(title)
if xticks is not None:
plt.xticks(np.arange(n)+0.5, xticks, rotation=90, size='x-small')
if yticks is not None:
plt.yticks(np.arange(m)+0.5, yticks, size='x-small')
if (xticks is None) and (xticks is None):
plt.gca().axis('off')
if show:
plt.show()
def generate_random(m, n, ratio):
"""Missing at random arrays"""
arr = np.zeros((m, n))
nan_count = int(ratio * m * n)
indices = np.random.choice(m * n, nan_count, replace=False)
arr.flat[indices] = np.nan
return arr
1. OptiMask API#
OptiMask is easy to use. It can be applied to NumPy arrays or pandas DataFrames, returning a subarray or subframe without NaN or the retained rows/columns. Since multiple optimizations are carried out starting from random states, a random_state
parameter is provided for reproducibility.
[3]:
x = generate_random(m=50, n=50, ratio=0.025)
plot(x)
[4]:
%time rows, cols = OptiMask(random_state=0).solve(x)
len(rows), len(cols), np.any(np.isnan(x[rows][:, cols]))
CPU times: total: 0 ns
Wall time: 50.4 ms
[4]:
(31, 35, False)
In red the removed rows and columns, in blue the remaining submatrix without NaN:
[5]:
plot(x, rows_to_keep=rows, cols_to_keep=cols)
OptiMask can directly return the submatrix:
[6]:
xt = OptiMask(random_state=0).solve(x, return_data=True)
xt.shape, np.any(np.isnan(xt))
[6]:
((31, 35), False)
OptiMask offers a verbose mode to inspect the results of the intermediate optimizations:
[7]:
xt = OptiMask(random_state=0, verbose=True).solve(x, return_data=True)
xt.shape, np.any(np.isnan(xt))
Trial 1 : submatrix of size 30x32 (960 elements) found.
Trial 2 : submatrix of size 27x37 (999 elements) found.
Trial 3 : submatrix of size 31x34 (1054 elements) found.
Trial 4 : submatrix of size 31x35 (1085 elements) found.
Trial 5 : submatrix of size 25x39 (975 elements) found.
Trial 6 : submatrix of size 27x37 (999 elements) found.
Trial 7 : submatrix of size 32x31 (992 elements) found.
Trial 8 : submatrix of size 30x33 (990 elements) found.
Trial 9 : submatrix of size 31x33 (1023 elements) found.
Trial 10 : submatrix of size 27x37 (999 elements) found.
Result: the largest submatrix found is of size 31x35 (1085 elements) found.
[7]:
((31, 35), False)
OptiMask provides a n_tries parameter, which enhances the probability of finding a better solution:
[8]:
%time xt = OptiMask(n_tries=35, random_state=0, verbose=True).solve(x, return_data=True)
xt.shape, np.any(np.isnan(xt))
Trial 1 : submatrix of size 30x32 (960 elements) found.
Trial 2 : submatrix of size 27x37 (999 elements) found.
Trial 3 : submatrix of size 31x34 (1054 elements) found.
Trial 4 : submatrix of size 31x35 (1085 elements) found.
Trial 5 : submatrix of size 25x39 (975 elements) found.
Trial 6 : submatrix of size 27x37 (999 elements) found.
Trial 7 : submatrix of size 32x31 (992 elements) found.
Trial 8 : submatrix of size 30x33 (990 elements) found.
Trial 9 : submatrix of size 31x33 (1023 elements) found.
Trial 10 : submatrix of size 27x37 (999 elements) found.
Trial 11 : submatrix of size 32x32 (1024 elements) found.
Trial 12 : submatrix of size 29x34 (986 elements) found.
Trial 13 : submatrix of size 32x33 (1056 elements) found.
Trial 14 : submatrix of size 30x33 (990 elements) found.
Trial 15 : submatrix of size 27x38 (1026 elements) found.
Trial 16 : submatrix of size 29x34 (986 elements) found.
Trial 17 : submatrix of size 32x32 (1024 elements) found.
Trial 18 : submatrix of size 31x32 (992 elements) found.
Trial 19 : submatrix of size 30x31 (930 elements) found.
Trial 20 : submatrix of size 28x35 (980 elements) found.
Trial 21 : submatrix of size 25x40 (1000 elements) found.
Trial 22 : submatrix of size 27x38 (1026 elements) found.
Trial 23 : submatrix of size 29x35 (1015 elements) found.
Trial 24 : submatrix of size 32x33 (1056 elements) found.
Trial 25 : submatrix of size 23x41 (943 elements) found.
Trial 26 : submatrix of size 27x37 (999 elements) found.
Trial 27 : submatrix of size 27x37 (999 elements) found.
Trial 28 : submatrix of size 28x36 (1008 elements) found.
Trial 29 : submatrix of size 32x32 (1024 elements) found.
Trial 30 : submatrix of size 29x34 (986 elements) found.
Trial 31 : submatrix of size 24x40 (960 elements) found.
Trial 32 : submatrix of size 30x34 (1020 elements) found.
Trial 33 : submatrix of size 27x37 (999 elements) found.
Trial 34 : submatrix of size 36x29 (1044 elements) found.
Trial 35 : submatrix of size 28x36 (1008 elements) found.
Result: the largest submatrix found is of size 31x35 (1085 elements) found.
CPU times: total: 46.9 ms
Wall time: 137 ms
[8]:
((31, 35), False)
OptiMask can also handle pandas DataFrames:
[9]:
df = pd.DataFrame(x, columns=[f'feature {k}' for k in range(x.shape[1])])
df.sample(3)
[9]:
feature 0 | feature 1 | feature 2 | feature 3 | feature 4 | feature 5 | feature 6 | feature 7 | feature 8 | feature 9 | ... | feature 40 | feature 41 | feature 42 | feature 43 | feature 44 | feature 45 | feature 46 | feature 47 | feature 48 | feature 49 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
49 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
10 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
9 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 rows × 50 columns
[10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 50 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 feature 0 49 non-null float64
1 feature 1 48 non-null float64
2 feature 2 49 non-null float64
3 feature 3 48 non-null float64
4 feature 4 49 non-null float64
5 feature 5 49 non-null float64
6 feature 6 49 non-null float64
7 feature 7 47 non-null float64
8 feature 8 48 non-null float64
9 feature 9 50 non-null float64
10 feature 10 47 non-null float64
11 feature 11 49 non-null float64
12 feature 12 49 non-null float64
13 feature 13 50 non-null float64
14 feature 14 49 non-null float64
15 feature 15 48 non-null float64
16 feature 16 50 non-null float64
17 feature 17 47 non-null float64
18 feature 18 48 non-null float64
19 feature 19 49 non-null float64
20 feature 20 47 non-null float64
21 feature 21 49 non-null float64
22 feature 22 49 non-null float64
23 feature 23 49 non-null float64
24 feature 24 48 non-null float64
25 feature 25 50 non-null float64
26 feature 26 50 non-null float64
27 feature 27 48 non-null float64
28 feature 28 49 non-null float64
29 feature 29 50 non-null float64
30 feature 30 48 non-null float64
31 feature 31 48 non-null float64
32 feature 32 50 non-null float64
33 feature 33 48 non-null float64
34 feature 34 49 non-null float64
35 feature 35 48 non-null float64
36 feature 36 49 non-null float64
37 feature 37 48 non-null float64
38 feature 38 48 non-null float64
39 feature 39 48 non-null float64
40 feature 40 48 non-null float64
41 feature 41 49 non-null float64
42 feature 42 50 non-null float64
43 feature 43 50 non-null float64
44 feature 44 50 non-null float64
45 feature 45 50 non-null float64
46 feature 46 49 non-null float64
47 feature 47 49 non-null float64
48 feature 48 49 non-null float64
49 feature 49 49 non-null float64
dtypes: float64(50)
memory usage: 19.7 KB
[11]:
dft = OptiMask(n_tries=35).solve(df, return_data=True)
dft.info()
<class 'pandas.core.frame.DataFrame'>
Index: 28 entries, 0 to 49
Data columns (total 38 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 feature 0 28 non-null float64
1 feature 1 28 non-null float64
2 feature 2 28 non-null float64
3 feature 5 28 non-null float64
4 feature 6 28 non-null float64
5 feature 8 28 non-null float64
6 feature 9 28 non-null float64
7 feature 10 28 non-null float64
8 feature 11 28 non-null float64
9 feature 13 28 non-null float64
10 feature 14 28 non-null float64
11 feature 15 28 non-null float64
12 feature 16 28 non-null float64
13 feature 19 28 non-null float64
14 feature 22 28 non-null float64
15 feature 23 28 non-null float64
16 feature 24 28 non-null float64
17 feature 25 28 non-null float64
18 feature 26 28 non-null float64
19 feature 27 28 non-null float64
20 feature 28 28 non-null float64
21 feature 29 28 non-null float64
22 feature 30 28 non-null float64
23 feature 31 28 non-null float64
24 feature 32 28 non-null float64
25 feature 33 28 non-null float64
26 feature 34 28 non-null float64
27 feature 37 28 non-null float64
28 feature 38 28 non-null float64
29 feature 39 28 non-null float64
30 feature 40 28 non-null float64
31 feature 42 28 non-null float64
32 feature 43 28 non-null float64
33 feature 44 28 non-null float64
34 feature 45 28 non-null float64
35 feature 47 28 non-null float64
36 feature 48 28 non-null float64
37 feature 49 28 non-null float64
dtypes: float64(38)
memory usage: 8.5 KB
[12]:
index, columns = OptiMask(n_tries=35).solve(df)
index, columns
[12]:
(Index([ 0, 1, 4, 6, 7, 8, 10, 12, 13, 14, 15, 17, 18, 19, 21, 22, 23, 27,
28, 30, 32, 33, 34, 40, 41, 42, 44, 45, 46, 47, 49],
dtype='int64'),
Index(['feature 0', 'feature 2', 'feature 5', 'feature 6', 'feature 8',
'feature 9', 'feature 10', 'feature 13', 'feature 14', 'feature 15',
'feature 16', 'feature 18', 'feature 19', 'feature 20', 'feature 22',
'feature 23', 'feature 25', 'feature 26', 'feature 28', 'feature 29',
'feature 31', 'feature 32', 'feature 33', 'feature 34', 'feature 36',
'feature 37', 'feature 38', 'feature 39', 'feature 40', 'feature 42',
'feature 43', 'feature 44', 'feature 45', 'feature 47', 'feature 49'],
dtype='object'))
2. Understanding the problem at hand#
In the context of a matrix containing a lone NaN cell, the central challenge emerges: determining whether to eliminate the corresponding row or column. This decision is readily resolved by examining the matrix’s shape. Specifically, if the matrix has a greater number of rows than columns, the optimal approach involves removing the associated row. Conversely, when the matrix has more columns than rows, the most effective course of action is to eliminate the corresponding column:
[13]:
x = np.zeros((12, 5))
x[7, 1] = np.nan
plt.figure()
plt.subplot(1, 3, 1)
plot(x, show=False, title='original data')
plt.subplot(1, 3, 2)
plot(x, rows_to_remove=[7], show=False, title='remove row')
plt.subplot(1, 3, 3)
plot(x, cols_to_remove=[1], show=False, title='remove column')
plt.show()
Eliminating the highlighted row yields the most extensive submatrix devoid of NaN. This solution remains optimal even when additional NaN values are present within the same row:
[14]:
x = np.zeros((12, 5))
x[7, [1, 3]] = np.nan
plt.figure()
plt.subplot(1, 3, 1)
plot(x, show=False, title='original data')
plt.subplot(1, 3, 2)
plot(x, rows_to_remove=[7], show=False, title='remove row')
plt.subplot(1, 3, 3)
plot(x, cols_to_remove=[1, 3], show=False, title='remove columns')
plt.show()
But what about more complex cases ?
[15]:
m, n = 60, 20
ratio = 0.03
x = generate_random(m, n, ratio)
plot(x)
3. The Problem from an Optimization Perspective#
This problem can be formalized as a binary optimization problem in two ways.
3.1. Linear programming#
Source : this mathematica.stackexchange answer, posted by unlikely.
Given:
Matrix \(A\) of shape \(m \times n\) and elements \(a_{i,j}\)
The goal is to determine the values of variables \((i \in [1~..~m]\), \(j \in [1~..~n])\):
\(e_{i,j} \in \{0,1\}\) (1 if element \((i,j)\) should be removed, 0 otherwise)
\(r_i \in \{0,1\}\) (1 if row \(i\) should be removed, 0 otherwise)
\(c_j \in \{0,1\}\) (1 if column \(j\) should be removed, 0 otherwise)
Subject to:
\(e_{i,j} = 1\) for every \((i,j)\) such that \(a_{i,j}\) is a NaN
\(r_i + c_j \geq e_{i,j}\), meaning if \(e_{i,j} = 1\) then either \(r_i = 1\) or \(c_j = 1\), or both
\(e_{i,j} \geq r_i\), indicating if \(r_i = 1\) then \(e_{i,j}\) must be 1
\(e_{i,j} \geq c_j\), indicating if \(c_j = 1\) then \(e_{i,j}\) must be 1
The objective is to minimize the total number of deleted cells:
The optimal values of \(r_i\) and \(c_j\) provide us with the specific rows and columns to remove, ensuring that the remaining matrix is efficiently processed and devoid of NaN values. This problem can be solved in Python using the usual tools of linear programming, such as combining Pyomo with GLPK. Its disadvantage is being relatively expensive, as for an \(m \times n\) matrix, more than \(m \times n\) (binary) variables are used.
[16]:
from pyomo.environ import Binary, ConcreteModel, ConstraintList, Objective, SolverFactory, Var, minimize
def linear_programming(matrix):
m, n = matrix.shape
model = ConcreteModel()
# Variables
model.e = Var(range(1, m + 1), range(1, n + 1), within=Binary)
model.r = Var(range(1, m + 1), within=Binary)
model.c = Var(range(1, n + 1), within=Binary)
# Objective
model.obj = Objective(expr=sum(model.e[i, j] for i in range(1, m + 1) for j in range(1, n + 1)), sense=minimize)
# Constraints
positions = [(i, j) for i in range(1, m + 1) for j in range(1, n + 1)]
nan_positions = [(i, j) for (i, j) in positions if np.isnan(matrix[i - 1, j - 1])]
model.nan_constraints = ConstraintList()
for i, j in nan_positions:
model.nan_constraints.add(model.e[i, j] == 1)
model.other_constraints = ConstraintList()
for i, j in positions:
model.other_constraints.add(model.r[i] + model.c[j] >= model.e[i, j])
model.other_constraints.add(model.e[i, j] >= model.r[i])
model.other_constraints.add(model.e[i, j] >= model.c[j])
# Solve the model
solver = SolverFactory('glpk')
solver.solve(model)
# Extract the results
rows_to_keep = [i-1 for i in range(1, m + 1) if model.r[i].value == 0]
cols_to_keep = [j-1 for j in range(1, n + 1) if model.c[j].value == 0]
return rows_to_keep, cols_to_keep
%time rows, cols = linear_programming(x)
plot(x, rows_to_keep=rows, cols_to_keep=cols)
print(f"The largest submatrix without NaN is of size {len(rows)}x{len(cols)} ({len(rows)*len(cols)} elements among {xt.size})")
CPU times: total: 188 ms
Wall time: 693 ms
The largest submatrix without NaN is of size 38x18 (684 elements among 1085)
Although producing the optimal solution, this approach becomes intractable for large-sized matrices.
3.2. Quadratic Programming#
By employing the variables mentioned earlier, an alternate perspective of the problem involves maximizing
while considering: - \(r_i + c_j \ge 1\) for each \((i,j)\) where \(a_{i,j}\) is a NaN.
Although the number of variables is reduced, the optimization problem becomes more challenging.
4. OptiMask#
4.1. The algorithm#
OptiMask’s algorithm aims to identify the optimal set of rows and columns for removal, maximizing the size of the submatrix without NaN values. Notably, a formal proof of convergence for this algorithm is currently unavailable.
The algorithm iteratively computes permutations of rows and columns until a specific convergence criterion is met. This criterion relies on detecting a contiguous and well-ordered NaN frontier, similar to a Pareto efficiency frontier. Once identified, the problem simplifies to finding the largest contiguous rectangle, a computationally straightforward task. The stored permutations help pinpoint the rows and columns to be removed.
OptiMask’s algorithm is heuristic, introducing uncertainty compared to a linear programming approach. To increase the likelihood of an optimal solution, the algorithm undergoes multiple optimizations through repeated restarts with random permutations. The n_tries
parameter controls the number of restarts, while the random_state parameter ensures reproducibility.
Let x
represent the array with missing values to be processed:
[17]:
x = generate_random(m=50, n=30, ratio=0.05)
plot(x)
The first step is to isolate the rows and columns with at least one NaN value. We know that the rows or columns with zero NaN values will necessarily be part of the sought submatrix. Now we work with a boolean matrix, which is set to True at the NaN cells of the original data. We introduce hx
and hy
, representing the indices of the highest True in each column and the indices of the rightmost True in each row, respectively.
[18]:
def heights(x, axis=0):
return x.shape[axis] - np.argmax(np.flip(x, axis=axis), axis=axis)
[19]:
nan_rows, nan_cols = np.isnan(x).nonzero()
nan_rows, nan_cols = np.unique(nan_rows), np.unique(nan_cols)
xp = np.isnan(x[nan_rows][:, nan_cols])
hx, hy = heights(xp, axis=0), heights(xp, axis=1)
plot(xp, xticks=hx, yticks=hy)
The algorithm’s core involves computing a series of permutations with the objective of ordering hx
and hy
to establish a Pareto frontier of NaN. This implies that both hx
and hy
should exhibit a decreasing trend. Importantly, at each step, the algorithm performs a permutation to order hx
, followed by a permutation to order hy
, and then alternates back and forth between hx
and hy
until both sequences exhibit a decreasing order.
[20]:
hx, hy = heights(xp, axis=0), heights(xp, axis=1)
k = 0
while not OptiMask._is_pareto_ordered(hx, hy):
axis = (k % 2)
p_step = np.argsort(-heights(xp, axis=axis))
if axis == 0:
xp = xp[:, p_step]
if axis == 1:
xp = xp[p_step]
hx, hy = heights(xp, axis=0), heights(xp, axis=1)
plot(xp, xticks=hx, yticks=hy, title=f'Iteration {k+1}')
k += 1
Then, we can now seek the largest contiguous rectangle starting from the upper right of the matrix (without forgetting the rows and columns without NaN, not displayed here). Since we have kept track of the successive permutations, both over rows and columns, the algorithm can return the rows and columns to remove (or keep).
This computation is performed n_tries
times over a copy of the matrix to process, where its rows and columns are each time randomly permuted. Each iteration results in a different NaN frontier and a different set of rows and columns to remove. After the n_tries
computations, the set that leads to the largest submatrix without NaN is retained. Therefore, increasing n_tries leads to better results, and it may even reach the optimal one.
4.2. n_tries
influence on the quality of the solution#
Given an m
xn
matrix with missing values occurring at a random frequency of ratio
, how does increasing n_tries
influence the size of the computed submatrix?
[21]:
def convergence_plot(m, n, ratio, n_tries, n_permutations=1000):
def cummax(arr): return np.maximum.accumulate(arr)
x = generate_random(m, n, ratio)
opt_rows, opt_cols = linear_programming(x)
opt_size = len(opt_rows) * len(opt_cols)
optimask = OptiMask(n_tries=1)
ret = np.zeros(n_tries)
for k in range(n_tries):
rows, cols = optimask.solve(x)
ret[k] = len(rows) * len(cols)
ret_mean = np.zeros_like(ret, dtype=float)
plt.figure(figsize=(8, 3))
for _ in range(n_permutations):
ret_iter = cummax(ret[np.random.permutation(len(ret))])
plt.plot(ret_iter, c='grey', lw=0.05)
ret_mean += ret_iter / n_permutations
plt.plot(ret_mean, c='k', lw=2)
plt.axhline(y=opt_size, c='r', linestyle='dashed')
plt.xlabel('number of random restarts')
plt.ylabel('size of the solution')
plt.title("Effect of Randomized Restarts on Heuristic Solution Quality")
plt.xlim(0, n_tries)
plt.text(x=1.025*n_tries, y=opt_size, s='size of the optimal solution', c='r')
plt.show()
[22]:
convergence_plot(m=40, n=40, ratio=0.025, n_tries=100)
4.3. What about structured NaN patterns ?#
OptiMask is also efficient on structured NaN patterns (the largest submatrix found is in blue, the removed cells are in red):
[23]:
def solve_and_plot(x):
rows, cols = OptiMask().solve(x)
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plot(x, show=False)
plt.subplot(1, 2, 2)
plot(x, rows_to_keep=rows, cols_to_keep=cols)
plt.show()
[24]:
n = 31
[25]:
x = np.zeros((n, n))
iy, ix = np.indices(x.shape)
x[np.mod(iy+ix, 2) == 0] = np.nan
solve_and_plot(x)
[26]:
x = np.zeros((n, n))
iy, ix = np.indices(x.shape)
x[(np.mod(ix, 2) == 0) & (np.mod(iy, 2) == 0)] = np.nan
solve_and_plot(x)
[27]:
x = np.zeros((n, n))
iy, ix = np.indices(x.shape)
x[np.mod(iy+ix, 3) == 0] = np.nan
solve_and_plot(x)
[28]:
x = np.zeros((n, n))
iy, ix = np.indices(x.shape)
x[np.mod(iy+ix, 8) == 0] = np.nan
solve_and_plot(x)
[29]:
x = np.zeros((n, n))
iy, ix = np.indices(x.shape)
x[np.mod(iy+ix, 12) == 0] = np.nan
solve_and_plot(x)
4.4. Dealing with large matrices#
OptiMask
can efficiently handle large matrices within a reasonable amount of time. Its time complexity is proportional to the number of NaN cells, and the number n_tries
.
[30]:
x = generate_random(m=10_000, n=100, ratio=0.02)
%time rows, cols = OptiMask().solve(x)
np.isnan(x[rows][:, cols]).any()
CPU times: total: 46.9 ms
Wall time: 72 ms
[30]:
False
[31]:
x = generate_random(m=1_000, n=1_000, ratio=0.02)
%time rows, cols = OptiMask().solve(x)
np.isnan(x[rows][:, cols]).any()
CPU times: total: 62.5 ms
Wall time: 78.6 ms
[31]:
False
[32]:
x = generate_random(m=100_000, n=1_000, ratio=0.02)
%time rows, cols = OptiMask().solve(x)
np.isnan(x[rows][:, cols]).any()
CPU times: total: 3.73 s
Wall time: 4.73 s
[32]:
False