RANSAC 随机采样一致

RANSAC是Random Sample Consensus（随机采样一致）的缩写。

RANSAC是一种通用算法，可以与其他参数估计方法一起使用，当数据中存在大量离群（outilers，即异常值）时，估计拟合的内群（inliers）的数学模型参数

RANSAC是一种迭代方法，也是一个非确定性算法，在某种意义上说，它会产生一个在一定概率下合理的结果，而更多次的迭代会使这一概率增加。此RANSAC算法在1981年由Fischler和Bolles首次提出，随后产生了许多变体，并在机器视觉、模型估计（Model Fitting）等领域有着巨大影响。

for a better world without outliers...

Introduction 介绍

一般的参数估计算法基于数据点中的噪声（离群outliers）服从零均值的高斯分布，因此，在对所有数据点进行平均的参数优化，就可以对噪声有很大的抑制，线性最小二乘（Linear least squares），便是这样的思想。

然而显而易见，这个假设在存在大量误差时，是不成立的，服从其他分布的噪声、严重的噪声值、采样过程中错误的测量值等等，都可能导致这样的大量误差。这样的噪声在我们进行参数估计时，应该被排除在外，以便我们获得更好的模型参数的估计值。

如下图所示：

Overview 概述

Motivation

RANSAC就是解决这种问题一个算法，或者说一种思想，通过排除可能存在的错误数据，来估计模型参数。

Assumptions

RANSAC假设数据由可以用模型解释的内群（inliers）组成，而离群（outliers）是完全不适合模型的错误数据。因此，在训练模型时使用离群（outliers）会增加我们的最终预测误差，因为它们几乎不包含有关模型的信息。

此外，RANSAC仅用内群（inliers）训练参数模型，而忽略离群（inliers）。

将数据分离为内群（inliers）和离群（outliers）将是一个有力的假设，它也是RANSAC与其他方法的主要区别。另外，即使这个假设对于数据集不成立，也不会影响RANSAC对模型的参数估计，因为在这种情况下，它将整个数据集考虑为内群（inliers），然后估计模型参数。

RANSAC使用一小部分样本来估计模型参数，而不是使用所有的数据，虽然这是一个软性假设（soft assumption），然而对大多数情况都适用，例如拟合一个直线模型，两个数据样本就足够了。

Method

RANSAC均匀随机地选择一个数据样本子集并用它来估计模型参数，然后选择数据集中，在生成模型的容错范围内的样本。

这些样本被认为与生成的模型一致，并被称为所选数据样本的一致集（consensus set），即一致集（consensus set）中的数据样本为内群（inlier），其余为离群（outliers）。

如果一致集（consensus set）中的样本数量足够高，就可以使用来更新模型参数。

重复对这个过程进行多次迭代，更新生成模型参数，最终返回误差最小的模型。

Algorithm 算法

Prerequisites

实际上，RANSAC并不是一个完整的可自我工作算法。实际上，在存在离群（outliers）的情况下，它使用模型及其距离函数，鲁棒地估计模型参数。RANSAC只是在数据集上应用离群点分离，并根据最优化方法和模型的距离函数，来使用内群（inliers）估计模型参数。

因此，在使用RANSAC之前，必须确定模型，距离函数和优化算法。不过，这也使得RANSAC完全模块化，允许它与任何估算模型，任何距离函数和任何优化方法一起工作。从这个角度说，RANSAC也可以被认为是一种思想。

Flow

RANSAC算法流程的一般描述如下所示：

Given:

data - a set of observed data points
model - a model that can be fitted to data points
n - the minimum number of data values required to fit the model
k - the maximum number of iterations allowed in the algorithm
t - a threshold value for determining when a data point fits a model
d - the number of close data values required to assert that a model fits well to data

Return:

bestfit - model parameters which best fit the data (or nil if no good model is found)

RUN

iterations = 0
bestfit = nil
besterr = something really large

while iterations < k
    maybeinliers = n randomly selected values from data
    maybemodel = model parameters fitted to maybeinliers
    alsoinliers = empty set
    
    for every point in data not in maybeinliers
        if point fits maybemodel with an error smaller than t
            add point to alsoinliers
    
    if the number of elements in alsoinliers is > d
    % this implies that we may have found a good model
    % now test how good it is
        bettermodel = model parameters fitted to all points in maybeinliers and alsoinliers
        thiserr = a measure of how well model fits these points
    
        if thiserr < besterr
            bestfit = bettermodel
            besterr = thiserr
    
    increment iterations
return bestfit

The algorithm flow is referenced from Wikipedia (http://en.wikipedia.org/wiki/RANSAC)

Parameters

像在算法流程开始时所描述的那样，RANSAC需要一些预定义参数来运行：

n - 采样子集的数目
t - 容错阈值
d - 一致集的最小数目
k - 迭代次数
w - 数据集中的内群inliers（w）的比例也是很重要。

由于RANSAC是一个随机算法，为了增加找到最优模型的概率，同时仍然保持随机化算法对于穷举确定性算法的计算优势，一个合适的预定义参数是至关重要的。

这里有一些启发式来计算这些参数：

w - 数据集中的内群inliers的比例

虽然RANSAC不是算法中的一个直接参数，但是在算法参数的计算中使用了内群（inliers）的比例，甚至可以隐式地影响算法的复杂度。因此，在运行RANSAC之前，有一些关于数据集中离群（outliers）的比例的信息是有益的。

n - 采样子集的数目

采样子集的大小是RANSAC在每次迭代时对初始模型随机选择的样本数。RANSAC使用定义模型所需的最小样本数作为样本子集大小。

例如为了拟合线性模型，n为2即可，或拟合一个圆形模型，RANSAC选择3个数据样本作为3个点就足以定义一个圆。

$n\equiv minimum\: number\: of\: samples\: to\: define\: model$

t - 容错阈值

RANSAC使用容错阈值来确定数据样本是否与生成模型一致。低于阈值下的样本将形成该模型的一致集，如果找到了正确的模型，那么这将是数据集的内群（inliers）。因此，应根据内群（inliers）的高斯误差（gaussian error）来选择。

$t\approx gaussian\: error\: in\: inliers$

d - 一致集的最小数目

最小一致集阈值是为了生成该迭代的最终模型而被接受为一致集的最小样本数量。最小一致集阈值应等于或小于数据集里的内群（inliers）的数目

$d\approx w\cdot \left | DataSet \right |$

|DataSet|表示数据集中的所有样本数

k - 迭代次数

如下计算以某个概率成功运行的预期迭代次数：

选择一个样本是内群（inliers）的可能性： $P(inlier)= w$

选择的样本子集没有离群（outliers）的概率： $P(subset\: with\: no\: outliers) = w^n$

选择的样本子集有离群（outliers）的概率： $P(subset\: with\: outliers) = 1-w^n$

k次迭代选择的样本子集有离群（outliers）的概率： $P(k subset\: with\: outliers) = (1-w^n)^k$

运行算法失败的概率： $P(fail) = (1-w^n)^k$

运行算法成功的概率： $P(success) = 1-(1-w^n)^k$

预期的迭代次数： $k=\frac{log(1-P(success))}{log(1-w^n)}$

举个例子，直线拟合问题的RANSAC的迭代计算次数：

$For \left | DataSet \right | = 12,\: n = 2\: and\: P(success)=0.99$

$w = 0.95 \Rightarrow k=2$

$w = 0.75 \Rightarrow k=6$

$w = 0.5 \Rightarrow k=17$

如果我们希望每次都可以成功拟合的话：

$k = \left ( \frac{\left | DataSet \right |}{2} \right )=\frac{\left | DataSet \right |\cdot \left | DataSet \right |-1}{2}=66$

Conclusion 总结

优点

非常的鲁棒性
可应用于大多数情形的一般方法
快速处理大量数据
易于实施
模块化
可解释性

缺点

随机性算法
需要有关数据的先验知识
迭代次数以离群百分比为对数增加
参数选择

Python实现

代码见：GitHub

对于RANSAC实现，我们先需要定义一个拟合模型，一个距离函数，一个拟合函数

这里我们拟合一个直线模型，在距离函数上使用绝对距离，拟合函数利用np.linalg.lstsq最小二乘法

class Model(object):
    def fit(self, data):
        raise NotImplementedError
        
    def distance(self, data):
        raise NotImplementedError

class LineModel(Model):
    """
    A 2D line model.
    """
    def __init__(self):
        self.params = None
        self.dists = None
        
    def fit(self, data):
        """
        Fits the model to the data, minimizing the sum of absolute errors.
        """
        X = data[:,0]
        Y = data[:,1]
        A = np.vstack([X, np.ones(len(X))]).T
        k, m = np.linalg.lstsq(A, Y, None)[0]
        self.params = [k, m]
        self.residual = sum(abs(k * X + m - Y))
        
    def distance(self, samples):
        """
        Calculates the vertical distances from the samples to the model.
        """
        X = samples[:,0]
        Y = samples[:,1]
        k = self.params[0]
        m = self.params[1]
        dists = abs(k * X + m - Y)
        
        return dists

接下来，根据算法流程，完成RANSAC算法

def ransac(data, model, min_samples, min_inliers, iterations=100, eps=1e-10, random_seed=42):
    """
    Fits a model to observed data.
    
    Uses the RANSC iterative method of fitting a model to observed data.
    """
    random.seed(random_seed)
    
    if len(data) <= min_samples:
        raise ValueError("Not enough input data to fit the model.")
        
    if 0 < min_inliers and min_inliers < 1:
        min_inliers = int(min_inliers * len(data))
        
    best_params = None
    best_inliers = None
    best_residual = np.inf
    best_iteration = None
    
    for i in range(iterations):
        indices = list(range(len(data)))
        random.shuffle(indices)
        inliers = np.asarray([data[i] for i in indices[:min_samples]])
        shuffled_data = np.asarray([data[i] for i in indices[min_samples:]])
        
        try:
            model.fit(inliers)
            dists = model.distance(shuffled_data)
            more_inliers = shuffled_data[np.where(dists <= eps)[0]]
            inliers = np.concatenate((inliers, more_inliers))
            
            if len(inliers) >= min_inliers:
                model.fit(inliers)
                if model.residual < best_residual:
                    best_params = model.params
                    best_inliers = inliers
                    best_residual = model.residual
                    best_iteration = i
                    
        except ValueError as e:
            print(e)
        
    if best_params is None:
        raise ValueError("RANSAC failed to find a sufficiently good fit for the data.")
    else:
        return (best_params, best_inliers, best_residual, best_iteration)

测试一下，我们实现的RANSAC效果：

import numpy as np
import matplotlib.pyplot as plt

import random
import Ransac
import time

num_iterations = 500
num_samples = 2000
noise_ratio = 0.9
num_noise = int(noise_ratio * num_samples)

def setup():
    global ax1
    X = np.asarray(range(num_samples))
    Y = 0.5 * X
    noise = [random.randint(0, 2 * (num_samples - 1)) for i in range(num_noise)]
    Y[random.sample(range(len(Y)), num_noise)] = noise
    data = np.asarray([X, Y]).T
    model = Ransac.LineModel()
    plt.figure(figsize=(10,5))
    ax1 = plt.subplot(1,2,1)
    plt.plot(X, Y, 'bx')
    
    return data, model

def run(data, model):
    random_seed = random.randint(0,100)
    start_time = time.time()
    (params, inliers, residual, iterations) = Ransac.ransac(data, model, 2, (1 - noise_ratio) * num_samples, num_iterations, eps=0.1, random_seed=random_seed)
    end_time = time.time()
    mean_time = (end_time - start_time) / num_iterations
    return params, residual, mean_time, iterations
    
    
def summary(params, residual, mean_time, iterations):
    print(" Paramters ".center(40, '='))
    print(params)
    print(" Residual ".center(40, '='))
    print(residual)
    print(" Iteration ".center(40, '='))
    print(iterations, "iteration finds the best params")
    print(" Time ".center(40, '='))
    print("%.1f msecs mean time spent per iteration" % (1000 * mean_time))
    X = np.asanyarray([0, num_samples - 1])
    Y = params[0] * X + params[1]
    plt.subplot(1,2,2, sharey = ax1)
    plt.plot(X, Y, 'g-', linewidth=2)
    
    plt.show()
    

data, model = setup()
try:
    params, residual, mean_time, iterations = run(data, model)
    summary(params, residual, mean_time, iterations)
except ValueError as e:
    print(e)

============== Paramters ===============
[0.5, -2.9994712862791754e-15]
=============== Residual ===============
1.0658141036401503e-14
============== Iteration ===============
404 iteration finds the best params
================= Time =================
3.3 msecs mean time spent per iteration

RANSAC

目录