Machine learning cheatsheet

#ai #mathematic #cheatsheet # [[Epistemic status]] #shower-thought #to-digest # Changelog ```dataview TABLE WITHOUT ID file.mtime AS "Last Modified" FROM [[#]] SORT file.mtime DESC LIMIT 3 ``` # Related # TODO > [!TODO] TODO > Computer vision > [[Reinforcement Learning]] > Conversational AI > [[Alignment]] > graphs https://neptune.ai/blog/graph-neural-network-and-some-of-gnn-applications # Machine learning | Category | Name | Machine | Human | | | --------------------------- | ------------------------------ | ------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | | (Metric) Classification | Confusion matrix | | | | | (Metric) Classification | Accuracy | $\frac{TP}{C}$ | Overall performance of model | | | (Metric) Classification | Precision | $\frac{TP}{TP+FP}$ | How accurate the positive predictions are | | | (Metric) Classification | Recall | $\frac{TP}{TP+FN}$ | Coverage of actual positive sample | | | (Metric) Classification | F1 | $\frac{2TP}{2TP+FP+FN}$ | Hybrid metric useful for unbalanced classes | | | (Metric) Classification | Specificity | $\frac{TN}{TN+FP}$ | Coverage of actual negative sample | | | (Metric) Classification | True positive rate | $\frac{TP}{TP+FN}$ | Used to compute the receiver operating curve, equivalent to recall | | | (Metric) Classification | False positive rate | $\frac{FP}{TN+FP}$ | Used to compute the receiver operating curve | | | (Metric) Classification | Area under the curve | | Area below the receiver operating curve | | | (Loss) Classification | Cross entropy | | | | | (Loss) Classification | Negative loglikelihood | | | | | (Loss) Classification | Hinge loss | $\frac{1}{m}\sum_{i=1}^m\max(0, 1-y_if(x_i))$ | The advantages of the hinge loss are that it is robust to outliers and is less sensitive to the distribution of the data. | | | (Loss) Classification | KL/JS divergence | | | | | (Metric) Regression | Mean squared error | $\sum_{i=1}^{D}(x_i-y_i)^2$ | | | | (Metric) Regression | Mean Absolute Error | $\sum_{i=1}^{D}\|x_i-y_i\|$ | | | | (Metric) Regression | Root mean squared error | | | | | (Metric) Regression | R square | | | | | (Metric) Regression | Huber loss | | | | | (Metric) Transformer | Scaled Dot-Product attention | $Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$ | | | | (Metric) Transformer | Head attention | $Attention(Q W^Q_i, K W^K_i, V W^V_i)$ | | | | (Metric) Transformer | Multi-head attention | $Concat(head_1, ..., head_h)W^0$ | | | | (Metric) Similarity | Cosine | | | | | (Metric) Similarity | Jaccard | | | | | (Metric) Similarity | Pointwise mutual information | | | | | (Metric) Ranking | Mean average precision | | | | | (Metric) Clustering | Normalized mutual information | | | | | (Metric) Language modelling | Perplexity | $2^{-\frac{1}{N}\sum_{i=1}^{N}\log_{2}(p(w_{i} \| w_{1} \ldots w_{i-1}))}$ | Perplexity is an exponentiation of the [[Entropy]]. The perplexity of a natural language model is a measure of how well the model predicts the next word in a sequence. It is calculated as the average log-probability of the next word, given the previous words in the sequence. | | | (Loss) Regularization | L1 | | | | | (Loss) Regularization | L2 | | | | | Activation | Sigmoid | $\frac{1}{1+e^{-z}}$ | | | | Activation | Tanh | $\frac{e^z-e^{-z}}{e^{-z}+e^z}$ | | | | Activation | Softmax | $\frac{e^{z_i}}{\sum_{j=1}^{D}e^{z_j}}$ | | | | | | | | | | Activation | ReLu | $max(0,z)$ | | | | Activation | Leaky ReLu | $max(0.1z,z)$ | | | | Activation | Gelu | | | | | Architecture | RNN | | | | | Architecture | Transformer | | | | | Architecture | Generative adversarial network | | | | | Architecture | Variational auto-encoder | | | | | Architecture | Graph convolutional network | | | | | Architecture | Batch normalization | ![[Pasted image 20221102103029.png]] | **Batch normalization** is a method used to make training faster and more stable through normalization of the layers' inputs by re-centering and re-scaling | | | Hack | Pooling | | | | | Hack | Average pooling | | | | | Hack | Max pooling | | | | ## TODO put in the table ### Vectors & Geometry of Space Distance between two points: $d = \sqrt{(x_1-x_0)^2 + (y_1-y_0)^2 + (z_1 - z_0)^2}$} Equation of a sphere: $R^2 = (x_1-x_0)^2 + (y_1-y_0)^2 + (z_1 - z_0)^2$} Unit vector: $\mathbf{u} = \frac{\bf{a}}{|\bf{a}|}$ Dot Product: $\bf{a}\cdot\bf{b}=\|\bf{a}\|\ \|\bf{b}\|\cos(\theta)$ Scalar projection of $\bf{a}$ on to $\bf{b}$: $\text{comp}_{\bf{b}}\bf{a}=|\bf {a} |\cos \theta =|\bf {a} |{\frac {\bf {a} \cdot \bf {b} }{|\bf {a} |\,|\bf {b} |}}={\frac {\bf {a} \cdot \bf {b} }{|\bf {b} |}}$ Vector projection of $\bf{a}$ on to $\bf{b}$: $\text{proj}_{\bf{b}}\bf{a}=a_{1}\bf {\hat {b}} ={\frac {\bf {a} \cdot \bf {b} }{|\bf {b} |}}{\frac {\bf {b} }{|\bf {b} |}}$ Orthogonal projection of $\bf{a}$ on to $\bf{b}$: $\bf{v} = \bf{a} - \text{proj}_{\bf{b}}\bf{a}$ Cross product: $\bf {a} \times \bf {b} =\left\|\bf {a} \right\|\left\|\bf {b} \right\|\sin(\theta )$ ### Lines & Planes Vector equation of a line: $\mathbf{r}(t) = \mathbf{r_0} + \mathbf{v}(t)$ Vector equation of a plane: $\bf{r} - \bf{r_0} \cdot \bf{n} = 0$ Scalar equation of a plane, where $a,b,c$ are components of the normal vector: $a(x-x_0) + b(y-y_0) + c(z-z_0) = 0$ Distance from the point $P=(x_1,y_1,z_1)$ to the plane $Ax+By+Cz+D=0$: $d = \frac{|Ax_1+By_1+Cz_1 +D|}{\sqrt{A^2+B^2+C^2}}$ %% etc https://github.com/DonneyF/MATH217FormulaSheet/blob/master/FormulaSheet.tex %% %% ## NN Value stores the data, the gradient, the previous children in the graph (DAG) and a function to compute the gradient Value implements a function to compute the gradient over all children using their internal gradient computing function. When an operation is ran over two Values or a Value and a number, the Values are computed normally and the internal gradient computing function is updated given the partial derivative formula of the used operation A MLP is constituted of Layers which is constituted of neurons which have both a Value for "bias" and multiple Value for "weights" A MLP can return its parameters which are the layers parameters which are the neurons parameters which are the weights plus the bias A loss function run on a single element or given a batch size, run the forward computation from the model, compute the losses which is the difference between the scores got through the forward computation and the given labels, the final loss is an average of these differences. We can run a L2 reg to avoid overfitting? adding a small bias to the loss using 1e-4 x sum pxp model parameters This bias is added to the loss The accuracy is the percentage of correctly predicted labels The training/optimisation loop is simply about running the loss, zero grading the model running backward prop on the loss and updating model's weight according to the gradient %% ## Calculus & Derivatives ![[Pasted image 20221103112335.png]] ## ~~ ![[python-cheatsheets.pdf]] ## External links <iframe width="560" height="315" src="https://www.youtube.com/embed/VMj-3S1tku0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <iframe width="560" height="315" src="https://www.youtube.com/embed/a0-mULz6nhI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>