跳到正文
Mr Bun's Blog
返回

Linear Regression with One Variable

编辑页面

Table of contents

Open Table of contents

Introduction

This week, the machine learning course introduced an example of one variable linear regression. The example was really good, and it made it easy for me to understand some machine learning terms and concepts.

Linear Model

if we have a data set of two features and two target values as shown below:

# Load our data set
x_train = np.array([1.0, 2.0])   #features
y_train = np.array([300.0, 500.0])   #target value
fw,b(x(i))=wx(i)+b f_{w,b}(x^{(i)}) = wx^{(i)} + b

Where w and b are the parameters of this model, it is hoped that a set of w,b can fit the training data so that the predicted value obtained for each input x in the training set is minimally deviated from the actual value.

To achieve this, a cost function is introduced.

Cost Function

J(w,b)=12mi=0m1(fw,b(x(i))y(i))2 J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2

Here, the denominator 2 is for the convenience of canceling out the coefficient of the reciprocal term. Square each error value (predicted value minus actual value) and accumulate them. The goal is to minimize the cost function in order to find the smallest.

#Function to calculate the cost
def compute_cost(x, y, w, b):

    m = x.shape[0]
    cost = 0

    for i in range(m):
        f_wb = w * x[i] + b
        cost = cost + (f_wb - y[i])**2
    total_cost = 1 / (2 * m) * cost

    return total_cost

Gradient Descent

gradient descent was described as:

repeat until convergence:  {  w=wαJ(w,b)w  b=bαJ(w,b)b}\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline \; w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \; \newline b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \newline \rbrace \end{align*}

where, parameters ww, bb are updated simultaneously. α\alpha is the learning rate, which is a hyperparameter that controls the size of the step taken in the direction of the gradient. J(w,b)w\frac{\partial J(w,b)}{\partial w} and J(w,b)b\frac{\partial J(w,b)}{\partial b} are the partial derivatives of the cost function with respect to the parameters ww and bb respectively.

The gradient is defined as:

J(w,b)w=1mi=0m1(fw,b(x(i))y(i))x(i)J(w,b)b=1mi=0m1(fw,b(x(i))y(i))\begin{align} \frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)}\\ \frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})\\ \end{align}

The gradient descent algorithm is used to find the optimal values of the parameters ww and bb that minimize the cost function J(w,b)J(w,b). compute_gradient implements the gradient computation for linear regression.

def compute_gradient(x, y, w, b):
    """
    Computes the gradient for linear regression
    Args:
      x (ndarray (m,)): Data, m examples
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters
    Returns
      dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
      dj_db (scalar): The gradient of the cost w.r.t. the parameter b
     """

    # Number of training examples
    m = x.shape[0]
    dj_dw = 0
    dj_db = 0

    for i in range(m):
        f_wb = w * x[i] + b
        dj_dw_i = (f_wb - y[i]) * x[i]
        dj_db_i = f_wb - y[i]
        dj_db += dj_db_i
        dj_dw += dj_dw_i
    dj_dw = dj_dw / m
    dj_db = dj_db / m

    return dj_dw, dj_db

Then the gradient_descent function is implemented to update the parameters ww and bb using the gradient descent algorithm.

def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function):
    """
    Performs gradient descent to fit w,b. Updates w,b by taking
    num_iters gradient steps with learning rate alpha

    Args:
      x (ndarray (m,))  : Data, m examples
      y (ndarray (m,))  : target values
      w_in,b_in (scalar): initial values of model parameters
      alpha (float):     Learning rate
      num_iters (int):   number of iterations to run gradient descent
      cost_function:     function to call to produce cost
      gradient_function: function to call to produce gradient

    Returns:
      w (scalar): Updated value of parameter after running gradient descent
      b (scalar): Updated value of parameter after running gradient descent
      J_history (List): History of cost values
      p_history (list): History of parameters [w,b]
      """

    w = copy.deepcopy(w_in) # avoid modifying global w_in
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    p_history = []
    b = b_in
    w = w_in

    for i in range(num_iters):
        # Calculate the gradient and update the parameters using gradient_function
        dj_dw, dj_db = gradient_function(x, y, w , b)

        # Update Parameters using equation (3) above
        b = b - alpha * dj_db
        w = w - alpha * dj_dw

        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion
            J_history.append( cost_function(x, y, w , b))
            p_history.append([w,b])
        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters/10) == 0:
            print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
                  f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e}  ",
                  f"w: {w: 0.3e}, b:{b: 0.5e}")

    return w, b, J_history, p_history #return w and J,w history for graphing
# initialize parameters
w_init = 0
b_init = 0
# some gradient descent settings
iterations = 10000
tmp_alpha = 1.0e-2
# run gradient descent
w_final, b_final, J_hist, p_hist = gradient_descent(x_train ,y_train, w_init, b_init, tmp_alpha, iterations, compute_cost, compute_gradient)
print(f"(w,b) found by gradient descent: ({w_final:8.4f},{b_final:8.4f})")

Finally, we found the optimal values of the parameters ww and bb that minimize the cost function J(w,b)J(w,b).


编辑页面
分享这篇文章: