Week 2 - Logistic Regression as a Neural Network

Binary Classification

  • 1 (cat) vs 0 (non cat)
  • example Cat image
    • if Red, Green, Blue 64 x 64 pixels matrices
    • input feature vector : 64 x 64 x 3 dimension = (64 x 64 x 3, 1) matrix
  • Notation
    • (x,y) : single training example is represented by a pair. x is an x-dimensional feature vector and y is label 0 or 1
    • m training examples : {(x^1,y^1),...,(x^m,y^m)}
    • X : D (single training example dimension) by m(# of training) matrix (combined training examples)
    • Y : [ y^1,y^2,...,y^m ]

Logistic Regression

  • Given X, want \hat y = P(y=1 | x)
  • \hat y = \sigma (W^T X + b) #sigmoid function

Logistic Regression Cost Function

  • Given {(x^1,y^1),...,(x^m,y^m)}, want close \hat y to y^i
  • Loss (error) function
    • L(\hat y,y) = \frac 1 2 (\hat y - y)^2 must not be convex function
    • so, use L(\hat y,y) = - (y\log\hat y + (1-y)\log (1- \hat y))
  • Cost function
    • J(w,b) = \frac 1 m \sum_{i=1}^n L(\hat y^i,y^i)

Gradient Descent

  • Want to find w,b that minimize J(w,b)
  • In J(w) case, Repeat {  w  := w - \alpha \frac {dJ(w)} {dw}}
  • use \partial instead of d, if parameters are more than 1, like \frac {\partial J(w,b)} {\partial w}

Derivatives

  • slope of the function
  • by differential calculus

Derivatives with a Computation Graph

  • about back propagation
  • using chain rule
  • da, db, etc below are python variable name

Logistic Regression Gradient Descent

Gradient Descent on m Examples

  • J(w,b) = \frac 1 m \sum_{i=1}^n L(\hat y^i,y^i)
  •  \frac {\partial } {\partial w_i} J(w,b) = \frac 1 m \sum_{i=1}^n \frac {\partial } {\partial w_i}  L(\hat y^i,y^i)