# Week 2 - Logistic Regression as a Neural Network

Binary Classification

• 1 (cat) vs 0 (non cat)
• example Cat image
• if Red, Green, Blue 64 x 64 pixels matrices
• input feature vector : 64 x 64 x 3 dimension = (64 x 64 x 3, 1) matrix
• Notation
• (x,y) : single training example is represented by a pair. x is an x-dimensional feature vector and y is label 0 or 1
• m training examples : {$(x^1,y^1),...,(x^m,y^m)$}
• X : D (single training example dimension) by m(# of training) matrix (combined training examples)
• Y : [ $y^1,y^2,...,y^m$ ]

Logistic Regression

• Given X, want $\hat y$ = P(y=1 | x)
• $\hat y = \sigma (W^T X + b)$ #sigmoid function

Logistic Regression Cost Function

• Given {$(x^1,y^1),...,(x^m,y^m)$}, want close $\hat y$ to $y^i$
• Loss (error) function
• $L(\hat y,y) = \frac 1 2 (\hat y - y)^2$ must not be convex function
• so, use $L(\hat y,y) = - (y\log\hat y + (1-y)\log (1- \hat y))$
• Cost function
• $J(w,b) = \frac 1 m \sum_{i=1}^n L(\hat y^i,y^i)$

Gradient Descent

• Want to find w,b that minimize $J(w,b)$
• In $J(w)$ case, Repeat { $w := w - \alpha \frac {dJ(w)} {dw}$}
• use $\partial$ instead of $d$, if parameters are more than 1, like $\frac {\partial J(w,b)} {\partial w}$

Derivatives

• slope of the function
• by differential calculus

Derivatives with a Computation Graph

• about back propagation
• using chain rule
• da, db, etc below are python variable name

Logistic Regression Gradient Descent

Gradient Descent on m Examples

• $J(w,b) = \frac 1 m \sum_{i=1}^n L(\hat y^i,y^i)$
• $\frac {\partial } {\partial w_i} J(w,b) = \frac 1 m \sum_{i=1}^n \frac {\partial } {\partial w_i} L(\hat y^i,y^i)$