API Reference#

class optimask.OptiMask(n_tries=10, max_steps=32, random_state=None, verbose=False)#

OptiMask is a class for calculating the optimal rows and columns to retain in a 2D array or DataFrame to remove NaN values and preserve the maximum number of non-NaN cells. The class uses a heuristic optimization approach, and increasing the value of n_tries generally leads to better results, potentially reaching or closely approaching the optimal quantity.

Parameters:

n_tries (int) – The number of optimization attempts. Higher values may lead to better results.
max_steps (int) – The maximum number of steps to perform in each optimization attempt.
random_state (Union[int, None]) – Seed for the random number generator.
verbose (bool) – If True, print verbose information during optimization.

from optimask import OptiMask
import numpy as np

# Create a matrix with NaN values
m = 120
n = 7
data = np.zeros(shape=(m, n))
data[24:72, 3] = np.nan
data[95, :5] = np.nan

# Solve for the largest submatrix without NaN values
rows, cols = OptiMask().solve(data)

# Calculate the ratio of non-NaN values in the result
coverage_ratio = len(rows) * len(cols) / data.size

# Check if there are any NaN values in the selected submatrix
has_nan_values = np.isnan(data[rows][:, cols]).any()

# Print or display the results
print(f"Coverage Ratio: {coverage_ratio:.2f}, Has NaN Values: {has_nan_values}")
# Output: Coverage Ratio: 0.85, Has NaN Values: False

solve(X: ndarray | DataFrame, return_data: bool = False) → Tuple[ndarray, ndarray] | Tuple[Index, Index]#

Solves the optimal problem of removing NaNs for a 2D array or DataFrame.

Parameters:

X (Union[np.ndarray, pd.DataFrame]) – The input 2D array or DataFrame with NaN values.
return_data (bool) – If True, returns the resulting data; otherwise, returns the indices.

Returns:

If return_data is True, returns the resulting 2D array or DataFrame; otherwise, returns the indices of rows and columns to retain.

Return type:

Union[Tuple[np.ndarray, np.ndarray], Tuple[pd.Index, pd.Index]]

Raises:

ValueError – If the input data is not a numpy array or a pandas DataFrame, or if the input numpy array does not have ndim==2, or if the OptiMask algorithm encounters an error during optimization.