machine_learning
#define machine_learning: \
I------------------------------------------\
I------------------------------------------\
I \
I /$$$$$$ /$$$$$$ \
I /$$__ $$|_ $$_/ \
I | $$ \ $$ | $$ \
I | $$$$$$$$ | $$ \
I | $$__ $$ | $$ \
I | $$ | $$ | $$ \
I | $$ | $$ /$$$$$$ \
I |__/ |__/|______/ \
I------------------------------------------\
I------------------------------------------I
# accurate cartoon depiction
https://xkcd.com/1838/
# introductory resource
http://neuralnetworksanddeeplearning.com
# "Machine Learning in C" (from scratch)
https://www.youtube.com/watch?v=PGSba51aRYU&list=PLpM-Dvs8t0VZPZKggcql-MmjaBdZKeDMw
# Machine Learning from scratch in Python
https://github.com/agvxov/neural_network_from_scratch
# music theme for the chapter
https://www.youtube.com/watch?v=j8wHeVbPQI8
" \
Breakthroughs that lead the Singularity could be reached by anyone. \
It could happen literally anywhere. If a few geniuses set up shop \
in Theodore Kaczynski’s former log cabin in Montana, \
the Singularity could begin there. \
" - Michell Heisman, Suicide Note, 1708.
• "But bro, when will i ever use matrix operations and calculus?\
Those are such a waste of time!"
Here. At the same time, in fact. Enjoy hell, stalker child.
• think about it this way:
we are trying to construct a function based on given inputs and outputs;
most of the time these input-output pairs are partial too,
meaning there are inputs for which we do not know the output for;
mathematically speaking, we have no bloody clue what we are doing;
if such function already exist we do not know about it;
for this reason, we wish to deploy some method that approximates
our desired function as much as possible;
we construct a model based on our known input-output pairs,
start with a random function and a way to measure how
well it performs compared to the desired function
by comparing it's and the desired outputs,
then using derivatives we find what direction to tweak the
values of our approximation function to get closer to our
desired function
NEURONS: NEURONS:
• roughly imitates biological (human) neurons
Perceptron: Perceptron:
https://www.youtube.com/watch?v=4Gac5I64LM4
• also referes to "single layer neural network"
○ components
1. Inputs
2. Weights
3. Bias
4. Threshold
5. Output
• the original virtual neuron
• each input is binary, only the weights are fractions
• every input and its corresponding weight is factored;
then summarized;
bias is added;
this sum is judged based on a threshold value
• any minor change in the weights will most likely result in major changes in the output
• one could set up the weights by hand or bruteforce them, but both are tedious
input-1 \
‾‾\ __
input-2 ----| D>---> output
/ ‾‾
input-3 /‾‾
/ if Σⱼxⱼwⱼ =< threshold then 0
output {
\ if Σⱼxⱼwⱼ > threshold then 1
Logical_neuron: Logical_neuron:
https://en.wikipedia.org/wiki/Activation_function
○ components
1. Inputs
2. Weights
3. Bias
4. Activation function
5. Deactivation function
6. Output
• the output is a fraction, with most activation functions, between 0 and 1
I₁ ____
\ * W₁
‾‾‾‾‾\__ .───┬───.
* W₂ \│ │ '─.
I₂ --------------│ ∑ │ f[] >----
__/│ │ .─'
* W₃/ '───┴───'
I₃ ____/‾‾‾‾‾
— the original activation function is the Sigmoid function:
1
────────────
1 + exp(x)
• a minor change in a weight only results in a minor change in the output
• the deciding property of an activation function's fitness is its shape
{
1.0 ├─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
│ ___...|
│ _..--``
│ ,' │
│ ,‾
│ - │
│ .
│ - │
│ _'
│ _-' │
│___...--""
0.0 ┼─────────────────┼─────────────────┴
. -5 0 5
}
NEURAL_NETWORKS: NEURAL_NETWORKS:
• "NN"
• this new and revolutionary technology that will give us AGI
within 2 more weeks without a paradigm shift is from 1957
○ components:
1. Neurons
2. Architecture
3. Loss function
3. Learning algorithm
○ typical visual representation of the neural network
Physical architecture of single perceptron:
__
input ---| D>---> output
‾‾
Virtualized architecture of single perceptron:
_ _ _
{ }--->{ }--->{ }
‾ ‾ ‾
┃Input┃Hidden┃Output┃
┃layer┃layers┃ ┃
Feedforward Network
━━━━━━━━━━━━━━━━━━━━━▶
Dataflow
Architecture: Architecture:
• the method by which neurons are logically ordered,
in practice this means a web formed based on
output to input piping between neurons
Layers: Layers:
• a layer is a group of neurons which do not communicate with eachother,
however do share their input and output neurons
• the input layer is a virtual layer that corresponds to the input
• the output layer is a virtual layer that corresponds the output values
of the last physical layer
• a hidden layer is a layer between the input and the output layers
{
}
• a network where each layers outputs are fed as the next input,
but not elsewhere and always in one direction,
is called a feedforward network
• a non-feedforward network, where feedback loops are implemented
is called a recurrent network
• recurrent networks are more similar to the human brain than feedforward networks
• feedforward networks are easier to work with, so they enjoy the privilege of being
more researched
Network_of_Perceptron_neurons: Network_of_Perceptron_neurons:
XOR_problem: XOR_problem:
• cause of the first AI winter
• it is said that a perceptron is unable to learn calculating
the logical operaiotion explusive or
• Σⱼxⱼwⱼ is a linear equation; meaning in the plain it creates a line
• only linearly separable problems are solvable
{
1 │ x x
│
│
│
│
0 │ x x
┼──────────────
0 1
1 │ x x
│ A
│--..
│ ``--..
│ B ``--
0 │ x x
┼──────────────
0 1
1 │ x x 1 │ x '. x 1 │ x '. x
│'. │ '. true │ '. false
│ '. true │ '. │ '.
│ '. │ false '. │ true '.
│false '. │ '. │ '.
0 │ x '. x 0 │ x x 0 │ x x
┼────────────── ┼────────────── ┼──────────────
0 1 0 1 0 1
1 │ x/ x /
│? / ??? /
│ / /
│/ /
│ ?? / ??
0 │ x / x
┼──────────────
0 1
▲
X X │ X
│ X
X A │A
A │ AA X
◀────────┼────────▶
A A│ A X
X │
X │ X
│X
▼
▲ π
│ X
│ A X
│ X
│ A X
│ A X
│ AAA X
┼─────────────▶ R
│ A X
│A X
│
│ A X
│ X
│ X
▼ -π
}
Training: Training:
• "learning"/"fitting"
• the process of reassigning weights with the intend of gaining better outputs
• overfitting is the phenomenon when a model has adapted to the learning data
so well, that it is unable to perform good on other data
• the more complex the model, the more probable overfitting is
Supervision: Supervision:
Supervised_learning:
• learning data is labeled
Unsupervised_learning:
• learning data is not labeled
• the model forms its own concepts in the form of clusters
• requires significantly more data for effective training
• the resulting models tend to be more creative
{more reliable on data which was not in the learning set;
creates better AI art}
Learning_rate: Learning_rate:
• when its calculated what direction to converge to, the value of the learning
rate indicates the amount of change that should take effect
• the learning rate doesnt actually "know" how much to change,
its a (-n educated) guess
• the learning rate could cause the model to continously over shoot the optimal
values or to converge way too slow
— typical value interpretations:
... - 0.01
0.01 - 0.1
0.1 - ...
optimizer:
• an object or function which is in charge of dynamically chaning the learning rate
• consults the loss
— common optimizers:
• "Stochastic Gradient Descent
• "ADAptive Moment estimation"
• "Nonlinear ADAM"
Weight_updating: Weight_updating:
• traditionally weights are updated once in every epoch
• when weights are updated after each data point, that is called online learning;
its often the most simplistic approach when the dataset of an epoch cannot
fit into memory at once
Random:
• the brute forcing of weights
• can work ok-ish on very small networks
• basically useless, mostly for demonstration purposes
or to serve as a baseline
Finite_difference:
△f(x) = f (x + b) − f (x + a)
• derivative approximation
Backpropagation:
https://towardsdatascience.com/understanding-backpropagation-algorithm-7bb3aa2f95fd
https://neptune.ai/blog/backpropagation-algorithm-in-neural-networks-guide
https://pyimagesearch.com/2021/05/06/backpropagation-from-scratch-with-python/
• learning algorithm based on gradient descent and utalizing
the Leibniz chain rule
Reinforcement:
• used when arriving to the right conclusion requires a number of steps
• the desired result is either unknown or very hard to create a dataset from
• an agent monitors the environment and guesses the next best action to take
— a policy is used to determine when the agent performed well:
• akin to loss calculation, but no gradient descent is necessary
• when the policy only rewards the end result,
its possible that the agent fails to ever arrive there by pure chance,
making it unable to optimize its solution
• when the policy rewards small actions, its possible that it will overfit
to partial solutions, coming up with the most retarded of practices
• its hard to come up with a good policy for complex problems
• models are NOT portable between environments
• more hyperparameter sensitive than supervised learning
{
observation ┏━━━━━━━┓ action
┌───────────▶┃ Agent ┃────────┐
│ ┗━━━━━━━┛ │
│ ▲ │
│ │reward │
│ │ │
│ ┏━━━━━━━━━━━━━┓ │
└─────────┃ Environment ┃◀────┘
┗━━━━━━━━━━━━━┛
}
Fine_tuning:
• "transfer learning"
• common technique
• an already trained model being adapted to a more specific task
• being given a pretrained model and fine-tuning it is significantly
cheaper and faster than training from scratch
• the fine-tuning can be done on proprietary or obscure data
• full fine-tuning is fine-tuning that uses an identical process
to the initial training
• partial fine-tuning is fine-tuning where only a select subset
of the weights are updated, the rest are kept intact;
usually the outer layers are updated and the intuition
of the deep layers are reused
• additive fine-tuning is fine-tuning where new parameters are inserted;
sometimes entire layers are added; this helps the model retain its intelligence
similar to full fine-tuning, but is significantly cheaper
• prompt tuning involves preprocessing the user prompt;
the preprocessing is usually done by another,
significantly faster model that appends keywords,
examples to the desired output format, tone or bias
• RAG involves vector searching a document based on the user input
and further prompt tuning with this additional context;
traditionally not considered fine-tuning
Dataset: Dataset:
• the available data at during development time to train/test on
— the data set is usually split:
• training data; fed to the machine while it learns
• testing data; allocated for testing after learning is finished;
useful for finding out how well the model does on data that
it has never seen before, but in quality is equal to the training data
Augmentation:
• the process of generating more training data from the initial training data
• used for avoiding overfitting
— usually done by applying basic transformations to the dataset
• rotation
• zoom
• flipping
PCA: PCA:
• "Pricipal Component Analisys"
• in datasets, often times the same variable is encoded multiple times
• finding and removing redundancy in data
• "reducing dimension while perserving the variance present"
• in the context of NNs, it referes to optimizing the input
for training times
{ downsizing images to the edge of recognizability;
removing noise and color from images;
stripping one of height in cms/inches of horse when both are available
}
Batching: Batching:
• packing the dataset in smaller collections
• each batch is used independently to adjust weights
• smaller memory footprint
• the training data can be arbitrary large and still processable
• more frequient weight adjustments (might have a positive effect on
model performance)
• less accurate estimation
Tokenization: Tokenization:
• encoding for NNs
• neural networks can only understand arrays of numbers;
yet for their usefulness feeding something more complex would be ideal
• any data to be fed to a NN must be encoded
• in the case of average NNs, this means the data points are resized
to fit the bounds of the activation function and
is flattened so that every data point has its own input neuron
in a 1 dimensional manner
{
┌───────┐
│ # # │
│/''--__│
│_---'''│
└───────┘
${GRAYSCALE_VALUE}
────────────────────
255
${ASCII_VALUE}
──────────────────
127
' ' (#32); ' ' (#32); ' ' (#32); '#' (#35); ' ' (#32); '#' (#35); ' ' (#32);
'/' (#47); ''' (#39); ''' (#39); '-' (#45); '-' (#45); '_' (#95); '_' (#95);
'_' (#95); '-' (#45); '-' (#45); '-' (#45); ''' (#39); ''' (#39); ''' (#39);
' ' (#0.25); ' ' (#0.25); ' ' (#0.25); '#' (#0.28); ' ' (#0.25); '#' (#0.28); ' ' (#0.25);
'/' (#0.37); ''' (#0.31); ''' (#0.31); '-' (#0.35); '-' (#0.35); '_' (#0.75); '_' (#0.75);
'_' (#0.75); '-' (#0.35); '-' (#0.35); '-' (#0.35); ''' (#0.31); ''' (#0.31); ''' (#0.31);
' ' (#0.25); ' ' (#0.25); ' ' (#0.25); '#' (#0.28); ' ' (#0.25); '#' (#0.28); ' ' (#0.25); '/' (#0.37); ''' (#0.31); ''' (#0.31); '-' (#0.35); '-' (#0.35); '_' (#0.75); '_' (#0.75); '_' (#0.75); '-' (#0.35); '-' (#0.35); '-' (#0.35); ''' (#0.31); ''' (#0.31); ''' (#0.31);
[0.25, 0.25, 0.25, 0.28, 0.25, 0.28, 0.25, 0.37, 0.31, 0.31, 0.35, 0.35, 0.75, 0.75, 0.75, 0.35, 0.35, 0.35, 0.31, 0.31, 0.31]
}
Natural_language: Natural_language:
• when tokenizing natural language, per character encoding is usually not the best idea
• tokenizing by words / word segments can yield better results and require smaller networks
{ > # Word tokenization in Tensorflow
> import tensorflow as tf
> from tensorflow import keras
> from tensorflow.keras.preprocessing.text import Tokenizer
> examples = ['Heyo world!', 'Goodbye cruel world']
> print(Tokenizer.fit_on_texts(examples).word_index)
> t = Tokenizer()
> t.fit_on_texts(examples)
> print(t.word_index)
{'world': 1, 'heyo': 2, 'goodbye': 3, 'cruel': 4}
}
Autoencoder:
• encodes/decodes
• translates data to a more efficient representation then attempts to reconstruct it
Encoding Decoding
_ _
|i|' '|o|
|n| ' _ ' |u|
|p| '|#|' |t|
|u| .|#|. |p|
|t| . ‾ . |u|
| |. A .|t|
‾ | ‾
dense representation
input ~ output
• it teaches the AI to do its own PCA (see AT ?!; "principal component analisys")
• can be used to remove noise {damaged images} from input
• often used as a component for larger systems (?!)
FCN:
• "Fully Connected neural Network"
Convolution:
"?!/Image recognition/kernel"
• kernel operation
• ${N} dimensional (usually 2)
• a CNN or "Convolutional Neural Network" contails atleast one convolutional layer
• retains spacial information
• generally good at computer vision tasks
• smaller kernels generally perform better
• with stride results in an output of different size
○ hyperparameters
• kernel size
• strides (kernel shift amount) (>1 further reduces the output size)
• activation
• padding (dummy border for the input to modify {preserve} output size)
sizeof(output) := sizeof(input) - sizeof(kernel) + 1
ₘ₋₁ ₘ₋₁
y₍ᵢ,ₕ₎ := ∑ ∑ f₍ₖ,ₗ₎ * x₍ᵢ₊ₖ,ₕ₊ₗ₎
ᵏ⁼⁰ ˡ⁼⁰
+--+--+
| 1| 2|
┌──────────+--+--+────────┐
│ | 2| 1| │
│ +--+--+ │
│ │
│ Kernel │
│ │
#=====#--+--+ #==#--+--+
I 1| 2I 2| 1| I13I11| 6|
I--+--I--+--+ #==#--+--+
I 3| 2I 1| 0| |11|11|11|
#=====#--+--+ +--+--+--+
| 1| 2| 3| 4| |12|22|26|
+--+--+--+--+ +--+--+--+
| 3| 1| 1| 3|
+--+--+--+--+
Input Output
— max pooling:
• simply outputs the max value inside the kerner
RNN:
• "Recurrent Neural Network"
• neurons are layout in a self feeding architecture
• the information flow is recursive
• traditionally used to solve sequence-to-sequence (seq2seq) problems {translation}
Transformers:
https://www.youtube.com/watch?v=iDulhoQ2pro
arXiv:1706.03762
• modified feedforward networks
• have the advanteges of RNNs
• unlike RNNs they can be easily parallelized on a large scale
Multihead_attention:
┌────┴────┐
│ Concat │
└─────────┘
▲
│
┌┼┐
│││
┌───────││┴──────────┐
┌────────│┴──────────┐│
┌─────────┴──────────┐││
│ Scaled Dot-Product ││┘
│ Attention │┘
└─┬───────┬────────┬─┘
┌───┘││ │││ ││└──┐
│┌───┘│ │││ │└──┐│
││┌───┘ │││ └──┐││
┌──││┴───┐ ┌──││┴───┐ ┌──││┴───┐
┌───│┴───┐│ ┌───│┴───┐│ ┌───│┴───┐│
┌────┴───┐│┘┌────┴───┐│┘┌────┴───┐│┘
│ Linear │┘ │ Linear │┘ │ Linear │┘
└────────┘ └────────┘ └────────┘
▲ ▲ ▲
│ │ │
Query Key Value
Architecture:
Output Probabilities
▲
│
┌─────────┐
│ Softmax │
└─────────┘
▲
│
┌────────┐
│ Linear │
└────────┘
▲
├─┐
├┐│
┌──────││┼──────────┐
┌───────│┼──────────┐│
┌────────┼──────────┐││
│ ┌──────┴──────┐ │││
│ │ Add & Norm │<┐ │││
│ └──────┬──────┘ │ │││
┌────────────────┐ │ ┌──────┴──────┐ │ │││
┌────────────────┐│ │ │ Feed │ │ │││
┌────────────────┐││ │ │ Forward │ │ │││
│││ │││ │ └─────────────┘ │ │││
┌──────││┼──────────┐ │││ │ ▲ │ │││
┌───────│┼──────────┐│ │││ │ ├────────┘ │││
┌────────┼──────────┐││ │││ │ ┌─────────────┐ │││
│ ┌──────┴──────┐ │││ │││ │ │ Add & Norm │<┐ │││
│ │ Add & Norm │<┐ │││ │││ │ └──────┬──────┘ │ │││
│ └──────┬──────┘ │ │││ │││ │ ┌──────┴──────┐ │ │││
│ ┌──────┴──────┐ │ │││ │││ │ │ Masked │ │ │││
│ │ Feed │ │ │││ │││ │ │ Multi-Head │ │ │││
│ │ Forward │ │ │││ │││ │ │ Attention │ │ │││
│ └─────────────┘ │ │││ ││└───│ └─────────────┘ │ │││
│ ▲ │ │││ │└────│ ▲ ▲ ▲ │ │││
│ ├────────┘ │││ └─────┼───┴────┘ ├───┘ │││
│ ┌─────────────┐ │││ │ ┌─────────────┐ │││
│ │ Add & Norm │<┐ │││ │ │ Add & Norm │<┐ │││
│ └──────┬──────┘ │ │││ │ └──────┬──────┘ │ │││
│ ┌──────┴──────┐ │ │││ │ ┌──────┴──────┐ │ │││
│ │ │ │ │││ │ │ Masked │ │ │││
│ │ Multi-Head │ │ │││ │ │ Multi-Head │ │ │││
│ │ Attention │ │ │││ │ │ Attention │ │ │││
│ └─────────────┘ │ │││ │ └─────────────┘ │ │││
│ ▲ ▲ ▲ │ │││ │ ▲ ▲ ▲ │ │││
│ └────┼────┘ │ ││┘ │ └────┼────┘ │ ││┘
│ ├────────┘ │┘ │ ├────────┘ │┘
└────────┼──────────┘ └────────┼──────────┘
▲ ▲
│ │
/‾\ /‾\
| + | Positonal - Encodings | + |
\_/ \_/
▲ ▲
│ │
┌───────────┐ ┌───────────┐
│ Input │ │ output │
│ Embedding │ │ Embedding │
└───────────┘ └───────────┘
▲ ▲
│ │
Inputs Outputs shifted right
LLM:
https://thqihve5.bearblog.dev/the-modern-library-of-babel/
• "Large Language Models"
— the context window is the largest input a model can take;
since they have no other "mental" storage, this is practically their memory span;
measured in tokens
Hyperparameter_optimization:
• a hyperparameter is a configurable setting of the model that is
not fine-tuned during training {architecture; activation function}
• the problem with hyperparameter optimization in the field of AI is that
we have no mathematical way of knowing how different hyperparameters will
perform, except for eval-ing them of course, but thats expensively expensive
• educated guessing while eval-ing a few different setups is the best we can do
Coding_with_AI:
— vibe coding:
• the act of coding with an LLM and forgetting about the existence of the code
• meant to convey a sense of relief and programming without obstacles
• the term is often used by otherwise tech-illiterate people
• one-way-ticket to introduce security vulnerabilities into your code
which have been otherwise dead since the late 2000's
• the coining of the term marks the official beginning of the competency crisis
• for high purity, stable environments LLMs far excel at fetching documentation
• can be used as a linter
• can be used for spotting obvious bugs which the developer misses
due to hyper-fixation on other parts { calling "chroot" instead of "chdir" }
• the smaller the language, the more confident you can be in
the quality of the AIs suggestion
— incredible as a (very expensive) transpiler:
• you provide pseudo code
• you can allow the AI to implement smaller functions based on its signature
• works especially well for scripting
• the code will come out as high quality as if it was hand written,
but the annoyances of scripting were skipped
• all typos will be fixed
• all libraries are resolved
• all library function names are guessed based on the description
• all argument orders are corrected
Tips:
• do not ask the model to omit comments;
even tho its extensive usage of comments is annoying,
it helps with its accuracy
• indentation and proper usage of newlines is enough
for the model to judge the span of code blocks
{
complete the following pseudo code:
perl script
strict mode
function usage
print name "<file>"
die
if no arguments
print usage
end if
lines = open argv[1]
if error usage
while last line matches regex "#define .*"
delete last line
print lines
#!/usr/bin/perl
use strict;
use warnings;
sub usage {
print "$0 <file>\n";
die;
}
@ARGV == 1 or usage();
open my $fh, '<', $ARGV[0] or usage();
my @lines = <$fh>;
close $fh;
while (@lines && $lines[-1] =~ /#define .*/) {
pop @lines;
}
print @lines;
}