https://img.shields.io/pypi/v/optimask.svg https://anaconda.org/conda-forge/optimask/badges/version.svg https://anaconda.org/conda-forge/optimask/badges/downloads.svg

OptiMask Documentation#

OptiMask is a Python package designed for efficiently handling NaN values in matrices, specifically focusing on computing the largest non-contiguous submatrix without NaN. In contrast to optimal but computationally expensive linear programming approaches, OptiMask employs a heuristic method, relying solely on Numpy for speed and efficiency. In machine learning applications, OptiMask surpasses traditional methods like pandas dropna by maximizing the amount of valid data available for model fitting. It strategically identifies the optimal set of columns (features) and rows (samples) to retain or remove, ensuring that the largest (non-contiguous) submatrix without NaN is utilized for training models.

The problem differs from the computation of the largest rectangles of 1s in a binary matrix (which can be tackled with dynamic programming) and requires a novel approach.

Basic Usage#

To use OptiMask, you can create an instance of the OptiMask class and apply the solve method to find the optimal rows and columns for a given 2D array or DataFrame. Here’s a basic example:

from optimask import OptiMask
import numpy as np

# Create a matrix with NaN values
data = np.zeros((60, 25))
data[17, 5:15] = np.nan
data[40:50, 8] = np.nan

# Solve for the largest submatrix without NaN values
rows, cols = OptiMask().solve(data)

# Print the results
print(f"Optimal Rows: {rows}")
print(f"Optimal Columns: {cols}")

The grey cells represent the NaN locations, the blue ones represent the valid data, and the red ones represent the rows and columns removed by the algorithm:

Example Image 1

OptiMask’s algorithm is useful for handling unstructured NaN patterns, as shown in the following example:

Example Image 2

For more detailed information on the parameters and usage, refer to the API reference.

Repository#

The source code of the package is available at CyrilJl/OptiMask.

Citation#

If you use OptiMask in your research or work, please cite it:

@INPROCEEDINGS{Joly2025-vq,
title = "{OptiMask}: Efficiently finding the largest {NaN-free}
submatrix",
booktitle = "Proceedings of the Python in Science Conference",
author = "Joly, Cyril",
abstract = "OptiMask is a heuristic designed to compute the largest, not
necessarily contiguous, submatrix of a matrix with missing
data. It identifies the optimal set of columns and rows to
remove to maximize the number of retained elements.",
publisher = "SciPy",
pages = "67--74",
month = jul,
year = 2025,
copyright = "https://creativecommons.org/licenses/by/4.0/",
conference = "Python in Science Conference, 2025",
location = "Tacoma, Washington"
}

This paper is available at https://doi.org/10.25080/uaha7744.