Pufka - Machine learning

machine_learning
#define machine_learning: \
I------------------------------------------\
I------------------------------------------\
I                                          \
I              /$$$$$$  /$$$$$$            \
I             /$$__  $$|_  $$_/            \
I            | $$  \ $$  | $$              \
I            | $$$$$$$$  | $$              \
I            | $$__  $$  | $$              \
I            | $$  | $$  | $$              \
I            | $$  | $$ /$$$$$$            \
I            |__/  |__/|______/            \
I------------------------------------------\
I------------------------------------------I

    
        # accurate cartoon depiction
        https://xkcd.com/1838/
        # introductory resource
        http://neuralnetworksanddeeplearning.com
        # "Machine Learning in C" (from scratch)
        https://www.youtube.com/watch?v=PGSba51aRYU&list=PLpM-Dvs8t0VZPZKggcql-MmjaBdZKeDMw
        # Machine Learning from scratch in Python
        https://github.com/agvxov/neural_network_from_scratch
        # music theme for the chapter
        https://www.youtube.com/watch?v=j8wHeVbPQI8

    "                                                                   \
    Breakthroughs that lead the Singularity could be reached by anyone. \
    It could happen literally anywhere. If a few geniuses set up shop   \
    in Theodore Kaczynski’s former log cabin in Montana,                \
    the Singularity could begin there.                                  \
    " - Michell Heisman, Suicide Note, 1708.

    •  "But bro, when will i ever use matrix operations and calculus?\
             Those are such a waste of time!"
           Here. At the same time, in fact. Enjoy hell, stalker child.
    • think about it this way:
      we are trying to construct a function based on given inputs and outputs;
      most of the time these input-output pairs are partial too,
      meaning there are inputs for which we do not know the output for;
      mathematically speaking, we have no bloody clue what we are doing;
      if such function already exist we do not know about it;
      for this reason, we wish to deploy some method that approximates
      our desired function as much as possible;
      we construct a model based on our known input-output pairs,
      start with a random function and a way to measure how
      well it performs compared to the desired function
      by comparing it's and the desired outputs,
      then using derivatives we find what direction to tweak the
      values of our approximation function to get closer to our
      desired function

    NEURONS:    NEURONS:
        • roughly imitates biological (human) neurons
        Perceptron:        Perceptron:
            
                https://www.youtube.com/watch?v=4Gac5I64LM4
            •  also referes to "single layer neural network"
            ○ components
                1. Inputs
                2. Weights
                3. Bias
                4. Threshold
                5. Output
            • the original virtual neuron
            • each input is binary, only the weights are fractions
            • every input and its corresponding weight is factored;
              then summarized;
              bias is added;
              this sum is judged based on a threshold value
            • any minor change in the weights will most likely result in major changes in the output
            • one could set up the weights by hand or bruteforce them, but both are tedious
                                                           
            input-1 \
                     ‾‾\ __ 
            input-2 ----|  D>---> output
                       / ‾‾
            input-3 /‾‾
                                                     
                    / if Σⱼxⱼwⱼ =< threshold then 0
            output {
                    \ if Σⱼxⱼwⱼ  > threshold then 1
        Logical_neuron:        Logical_neuron:
            
                https://en.wikipedia.org/wiki/Activation_function
            ○ components
                1. Inputs
                2. Weights
                3. Bias
                4. Activation function
                5. Deactivation function
                6. Output
            • the output is a fraction, with most activation functions, between 0 and 1
                                                       
                I₁ ____
                       \ * W₁
                        ‾‾‾‾‾\__ .───┬───.
                         * W₂   \│   │    '─.
                I₂ --------------│ ∑ │ f[]   >----
                              __/│   │    .─'
                         * W₃/   '───┴───'
                I₃ ____/‾‾‾‾‾                          
            — the original activation function is the Sigmoid function:
                     1
                ────────────
                 1 + exp(x)
            • a minor change in a weight only results in a minor change in the output
            • the deciding property of an activation function's fitness is its shape
            { // Shape of the Sigmoid function
                1.0 ├─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
                    │                             ___...|
                    │                      _..--``
                    │                    ,'             │
                    │                  ,‾
                    │                 -                 │
                    │                .
                    │               -                   │
                    │             _'
                    │          _-'                      │
                    │___...--""
                0.0 ┼─────────────────┼─────────────────┴
                .    -5               0                5
            }


    NEURAL_NETWORKS:    NEURAL_NETWORKS:
        • "NN"
        • this new and revolutionary technology that will give us AGI
          within 2 more weeks without a paradigm shift is from 1957
        ○ components:
            1. Neurons
            2. Architecture
            3. Loss function
            3. Learning algorithm
        ○ typical visual representation of the neural network
                                                /*
             Neuron-1             Neuron-2
              _____                _____
             /     \    Weight    /     \
            |       |____________|       |
            |       |            |       |
             \_____/              \_____/
                                                */
            Physical architecture of single perceptron:        
                        __ 
              input ---|  D>---> output
                        ‾‾                                     
            Virtualized architecture of single perceptron:        
                  _      _      _ 
                 { }--->{ }--->{ }
                  ‾      ‾      ‾                                 
            ┃Input┃Hidden┃Output┃
            ┃layer┃layers┃      ┃/*
               _     _ 
            │ ( )-│-( )  │      │
               ‾\\_//‾ \_
            │  _ /|\ _   \ _    │
              ( )-|-( )--│( )
            │  ‾ \|/ ‾  _/ ‾    │
               _//‾\\_ /
            │ ( )-│-( )  │      │
               ‾     ‾           */
            Feedforward Network  /*
                _     _ 
               ( )---( )
                ‾\\_//‾ \_
                _ /|\ _   \ _ 
               ( )-|-( )---( )
                ‾ \|/ ‾  _/ ‾
                _//‾\\_ /
               ( )---( )
                ‾     ‾          */     
           ━━━━━━━━━━━━━━━━━━━━━▶       
                  Dataflow              
        Architecture:        Architecture:
            • the method by which neurons are logically ordered,
              in practice this means a web formed based on
              output to input piping between neurons
            Layers:            Layers:
                • a layer is a group of neurons which do not communicate with eachother,
                  however do share their input and output neurons
                • the input layer is a virtual layer that corresponds to the input
                • the output layer is a virtual layer that corresponds the output values
                  of the last physical layer
                • a hidden layer is a layer between the input and the output layers
                { /* "The activation function is applied once per layer"
                        - said my prof to my greatest surprise.
                      I'm no expert at all, but even the most simplistic
                     article will tell you that in a neuron, after the
                     weighted inputs are added together, the activation
                     function is applied. So clearly, the activation
                     function is applied in a layer as many times
                     as the number of neurons, right?
                      Well, kinda, however a (fully connected) layer can be
                    expressed as a matrix:
                        let n := the number of neurons
                        let m := the number of the number of inputs
                                  (ie. the number of weights per neuron)
                        myLayer :=
                             __              __
                            |                  |
                            | W₁₁ W₁₂  ... W₁ₙ |
                            |                  |
                            | W₂₁ W₂₂  ... W₂ₙ |
                            |                  |
                            | ... ...  '-. ... |
                            |                  |
                            | Wₘ₁ Wₘ₂  ... Wₘₙ |
                            |                  |
                            |__              __|

                     Now, if we make the activation function a matrix
                    operation -and we should too, because GPUs
                    are outstanding at that-, it is in fact applied once.
                  */
                }
            • a network where each layers outputs are fed as the next input,
              but not elsewhere and always in one direction,
              is called a feedforward network
            • a non-feedforward network, where feedback loops are implemented
              is called a recurrent network
            • recurrent networks are more similar to the human brain than feedforward networks
            • feedforward networks are easier to work with, so they enjoy the privilege of being
              more researched
            Network_of_Perceptron_neurons:            Network_of_Perceptron_neurons:
                XOR_problem:                XOR_problem:
                    • cause of the first AI winter
                    • it is said that a perceptron is unable to learn calculating
                      the logical operaiotion explusive or
                    • Σⱼxⱼwⱼ is a linear equation; meaning in the plain it creates a line
                    • only linearly separable problems are solvable
                    { // Boolean values projected to the plain
                      1 │  x        x
                        │        
                        │        
                        │        
                        │        
                      0 │  x        x
                        ┼──────────────
                           0        1
                      // Our perceptron is able to place a single line on this plain
                      //  and label by it
                      1 │  x        x
                        │       A
                        │--..    
                        │    ``--..
                        │     B    ``--
                      0 │  x        x
                        ┼──────────────
                           0        1
                      // As represented like this, OR, AND and NAND are classifiable
                      //         OR                  AND                  NAND
                        1 │  x        x      1 │  x '.     x      1 │  x '.     x               
                          │'.                  │      '.  true      │      '.  false             
                          │  '.   true         │        '.          │        '.              
                          │    '.              │  false   '.        │  true    '.             
                          │false '.            │            '.      │            '.                   
                        0 │  x     '. x      0 │  x        x      0 │  x        x               
                          ┼──────────────      ┼──────────────      ┼──────────────                   
                             0        1           0        1           0        1                    
                      // NOTE: notice how the angle of the line is arbitrary,
                      //        there are multiple configurations that work just as well
                      // Now, if we wanted to do the same with XOR, we would be in trouble.
                      //  At least 2 lines would be required:
                        1 │  x/       x /
                          │? / ???     /
                          │ /         /
                          │/         /
                          │     ??  /   ??
                        0 │  x     /  x
                          ┼──────────────
                             0        1
                      // Now, i wish to show you a trick.
                      //  Assume the below data:
                                 ▲
                            X  X │   X
                                 │      X
                          X    A │A
                              A  │ AA   X
                        ◀────────┼────────▶
                              A A│ A   X
                          X      │
                              X  │      X
                                 │X
                                 ▼
                      // Clearly, there is no single line to separate 'A's from 'X's,
                      //  yet the border seems very trivial to us, if only could draw a cirle...
                      // However, we could represent our data in an alternative way,
                      //  say in a polar coordinate system, where horizontally we
                      //  represent the distence from the origo, and vertically the
                      //  angle closed with the original (positive) X axes.
                        ▲  π
                        │       X
                        │  A        X
                        │         X
                        │  A       X
                        │  A       X
                        │ AAA     X
                        ┼─────────────▶ R
                        │ A     X  
                        │A          X
                        │
                        │   A      X
                        │       X
                        │         X
                        ▼ -π
                      // With this transposed data set, we could do a linear separation and
                      //  there by teach it to a perceptron.
                      // You may ask, why can't we do something similar with the XOR problem?
                      // Well, we could. For example, if we order by the difference of the
                      //  input values, we would get something that is linearily separable.
                      // So, you can trick a perceptron into solving the logical operation XOR.
                      // However, the take away is that thats not the point.
                      // We could only trick the perceptron because we knew the transformation perfectly
                      //  and applied some smart transformation.
                      // In the real word we hardly know the function of differenciating cats from
                      //  dogs, and perhaps there exists no representation where thats linear.
                      // The XOR problem is the difficulty it shows, not about how we were missing
                      //  an AI logic gate that we would have needed for something.
                    }


    Training:    Training:
        • "learning"/"fitting"
        • the process of reassigning weights with the intend of gaining better outputs
        • overfitting is the phenomenon when a model has adapted to the learning data
          so well, that it is unable to perform good on other data
        • the more complex the model, the more probable overfitting is
        Supervision:        Supervision:
            Supervised_learning:
                • learning data is labeled
            Unsupervised_learning:
                • learning data is not labeled
                • the model forms its own concepts in the form of clusters
                • requires significantly more data for effective training
                • the resulting models tend to be more creative
                  {more reliable on data which was not in the learning set;
                    creates better AI art}
        Learning_rate:        Learning_rate:
            • when its calculated what direction to converge to, the value of the learning
              rate indicates the amount of change that should take effect
            • the learning rate doesnt actually "know" how much to change,
              its a (-n educated) guess
            • the learning rate could cause the model to continously over shoot the optimal
              values or to converge way too slow
            — typical value interpretations: // why do i feel like as if i were making notes of zodiac signs?
               ...  - 0.01  // small
               0.01 - 0.1   // medium
               0.1  - ...   // large
            optimizer:
                • an object or function which is in charge of dynamically chaning the learning rate
                • consults the loss
                — common optimizers:
                    • "Stochastic Gradient Descent
                    • "ADAptive Moment estimation"
                    • "Nonlinear ADAM"
        Weight_updating:        Weight_updating:
            • traditionally weights are updated once in every epoch
            • when weights are updated after each data point, that is called online learning;
              its often the most simplistic approach when the dataset of an epoch cannot
              fit into memory at once
            Random:
                • the brute forcing of weights
                • can work ok-ish on very small networks
                • basically useless, mostly for demonstration purposes
                  or to serve as a baseline
            Finite_difference:
                △f(x) = f (x + b) − f (x + a)
                • derivative approximation
            Backpropagation:
                
                    https://towardsdatascience.com/understanding-backpropagation-algorithm-7bb3aa2f95fd
                    https://neptune.ai/blog/backpropagation-algorithm-in-neural-networks-guide
                    https://pyimagesearch.com/2021/05/06/backpropagation-from-scratch-with-python/
                • learning algorithm based on gradient descent and utalizing
                  the Leibniz chain rule
        Reinforcement:
            • used when arriving to the right conclusion requires a number of steps
            • the desired result is either unknown or very hard to create a dataset from
            • an agent monitors the environment and guesses the next best action to take
            — a policy is used to determine when the agent performed well:
                • akin to loss calculation, but no gradient descent is necessary
                • when the policy only rewards the end result,
                  its possible that the agent fails to ever arrive there by pure chance,
                  making it unable to optimize its solution
                • when the policy rewards small actions, its possible that it will overfit
                  to partial solutions, coming up with the most retarded of practices
                • its hard to come up with a good policy for complex problems
            • models are NOT portable between environments
            • more hyperparameter sensitive than supervised learning
            {
                 observation ┏━━━━━━━┓ action
                ┌───────────▶┃ Agent ┃────────┐
                │            ┗━━━━━━━┛        │
                │                ▲            │
                │                │reward      │
                │                │            │
                │         ┏━━━━━━━━━━━━━┓     │
                └─────────┃ Environment ┃◀────┘
                          ┗━━━━━━━━━━━━━┛
            }
        Fine_tuning:
            • "transfer learning"
            • common technique
            • an already trained model being adapted to a more specific task
            • being given a pretrained model and fine-tuning it is significantly
              cheaper and faster than training from scratch
            • the fine-tuning can be done on proprietary or obscure data
            • full fine-tuning is fine-tuning that uses an identical process
              to the initial training
            • partial fine-tuning is fine-tuning where only a select subset
              of the weights are updated, the rest are kept intact;
              usually the outer layers are updated and the intuition
              of the deep layers are reused
            • additive fine-tuning is fine-tuning where new parameters are inserted;
              sometimes entire layers are added; this helps the model retain its intelligence
              similar to full fine-tuning, but is significantly cheaper
            • prompt tuning involves preprocessing the user prompt;
              the preprocessing is usually done by another,
              significantly faster model that appends keywords,
              examples to the desired output format, tone or bias
            • RAG involves vector searching a document based on the user input
              and further prompt tuning with this additional context;
              traditionally not considered fine-tuning


    Dataset:    Dataset:
        • the available data at during development time to train/test on
        — the data set is usually split:
            • training data; fed to the machine while it learns
            • testing data; allocated for testing after learning is finished;
              useful for finding out how well the model does on data that
              it has never seen before, but in quality is equal to the training data
        Augmentation:
            • the process of generating more training data from the initial training data
            • used for avoiding overfitting
            — usually done by applying basic transformations to the dataset
                • rotation
                • zoom
                • flipping
        PCA:        PCA:
            • "Pricipal Component Analisys"
            • in datasets, often times the same variable is encoded multiple times
            • finding and removing redundancy in data
            • "reducing dimension while perserving the variance present"
            • in the context of NNs, it referes to optimizing the input
              for training times
            { downsizing images to the edge of recognizability;
              removing noise and color from images;
              stripping one of height in cms/inches of horse when both are available
            }
        Batching:        Batching:
            • packing the dataset in smaller collections
            • each batch is used independently to adjust weights
            
                • smaller memory footprint
                • the training data can be arbitrary large and still processable
                • more frequient weight adjustments (might have a positive effect on
                  model performance)
            
                • less accurate estimation
        Tokenization:        Tokenization:
            • encoding for NNs
            • neural networks can only understand arrays of numbers;
              yet for their usefulness feeding something more complex would be ideal
            • any data to be fed to a NN must be encoded
            • in the case of average NNs, this means the data points are resized
              to fit the bounds of the activation function and
              is flattened so that every data point has its own input neuron
              in a 1 dimensional manner
            { // Assume an image (BELOW is trying to be the classing Windows XP hills background)
                ┌───────┐
                │   # # │
                │/''--__│
                │_---'''│
                └───────┘
              // We could encode pixel data as grayscale,
              //  leaving us with pixel values 0-255.
              // Assume our activation function is the Signmoid.
              // The Sigmoids value span is 0-1.
              // Every pixel will be representable as:
               ${GRAYSCALE_VALUE}
              ────────────────────
                      255
              // Or, sticking with our ascii art, we could take the ascii values (box omitted).
              // Using ISO ascii, we will only need 128 (7 bit) values..
                ${ASCII_VALUE}
              ──────────────────
                     127
              // Lets write out each ascii value:
                ' ' (#32); ' ' (#32); ' ' (#32); '#' (#35); ' ' (#32); '#' (#35); ' ' (#32);
                '/' (#47); ''' (#39); ''' (#39); '-' (#45); '-' (#45); '_' (#95); '_' (#95);
                '_' (#95); '-' (#45); '-' (#45); '-' (#45); ''' (#39); ''' (#39); ''' (#39);
              // Lets force these values between bounds (with rounding here):
                ' ' (#0.25); ' ' (#0.25); ' ' (#0.25); '#' (#0.28); ' ' (#0.25); '#' (#0.28); ' ' (#0.25);
                '/' (#0.37); ''' (#0.31); ''' (#0.31); '-' (#0.35); '-' (#0.35); '_' (#0.75); '_' (#0.75);
                '_' (#0.75); '-' (#0.35); '-' (#0.35); '-' (#0.35); ''' (#0.31); ''' (#0.31); ''' (#0.31);
              // Now flatten it into a 1 dimension:
                ' ' (#0.25); ' ' (#0.25); ' ' (#0.25); '#' (#0.28); ' ' (#0.25); '#' (#0.28); ' ' (#0.25); '/' (#0.37); ''' (#0.31); ''' (#0.31); '-' (#0.35); '-' (#0.35); '_' (#0.75); '_' (#0.75); '_' (#0.75); '-' (#0.35); '-' (#0.35); '-' (#0.35); ''' (#0.31); ''' (#0.31); ''' (#0.31);
              // Omitting the visualization meta data for humans, we are left with the following array,
              //  which could easily be easily fed to a neural network:
                [0.25, 0.25, 0.25, 0.28, 0.25, 0.28, 0.25, 0.37, 0.31, 0.31, 0.35, 0.35, 0.75, 0.75, 0.75, 0.35, 0.35, 0.35, 0.31, 0.31, 0.31]
            }
            Natural_language:            Natural_language:
                • when tokenizing natural language, per character encoding is usually not the best idea
                • tokenizing by words / word segments can yield better results and require smaller networks
                { > # Word tokenization in Tensorflow
> import tensorflow as tf
> from tensorflow import keras
> from tensorflow.keras.preprocessing.text import Tokenizer
> examples = ['Heyo world!', 'Goodbye cruel world']
> print(Tokenizer.fit_on_texts(examples).word_index)
> t = Tokenizer()
> t.fit_on_texts(examples)
> print(t.word_index)
{'world': 1, 'heyo': 2, 'goodbye': 3, 'cruel': 4}
 }

    Autoencoder:    //?!; organize
        • encodes/decodes
        • translates data to a more efficient representation then attempts to reconstruct it
                                                   
            Encoding     Decoding                
                _           _
               |i|'       '|o|
               |n| '  _  ' |u|
               |p|  '|#|'  |t|
               |u|  .|#|.  |p|
               |t| .  ‾  . |u|
               | |.   A   .|t|
                ‾     |     ‾                    
            dense representation
                                              
              input ~ output                  
        • it teaches the AI to do its own PCA (see AT ?!; "principal component analisys")
        • can be used to remove noise {damaged images} from input
        • often used as a component for larger systems (?!)

    FCN:
        • "Fully Connected neural Network"

    Convolution:
        
            "?!/Image recognition/kernel"
        • kernel operation
        • ${N} dimensional (usually 2)
        • a CNN or "Convolutional Neural Network" contails atleast one convolutional layer
        • retains spacial information
        • generally good at computer vision tasks
        • smaller kernels generally perform better
        • with stride results in an output of different size
        ○ hyperparameters
            • kernel size
            • strides (kernel shift amount) (>1 further reduces the output size)
            • activation
            • padding (dummy border for the input to modify {preserve} output size)
            sizeof(output) := sizeof(input) - sizeof(kernel) + 1
                     ₘ₋₁ ₘ₋₁
            y₍ᵢ,ₕ₎ := ∑   ∑  f₍ₖ,ₗ₎ * x₍ᵢ₊ₖ,ₕ₊ₗ₎
                         ᵏ⁼⁰ ˡ⁼⁰
                                                         
                          +--+--+   
                          | 1| 2|   
               ┌──────────+--+--+────────┐
               │          | 2| 1|        │
               │          +--+--+        │
               │                         │
               │          Kernel         │
               │                         │                
            #=====#--+--+              #==#--+--+         
            I 1| 2I 2| 1|              I13I11| 6|         
            I--+--I--+--+              #==#--+--+         
            I 3| 2I 1| 0|              |11|11|11|         
            #=====#--+--+              +--+--+--+         
            | 1| 2| 3| 4|              |12|22|26|         
            +--+--+--+--+              +--+--+--+         
            | 3| 1| 1| 3|             
            +--+--+--+--+             

              Input                       Output

        — max pooling:
            • simply outputs the max value inside the kerner
        
        
    RNN:
        • "Recurrent Neural Network"
        • neurons are layout in a self feeding architecture
        • the information flow is recursive
        • traditionally used to solve sequence-to-sequence (seq2seq) problems {translation}

    Transformers:
        
            https://www.youtube.com/watch?v=iDulhoQ2pro
            arXiv:1706.03762
        • modified feedforward networks
        • have the advanteges of RNNs
        • unlike RNNs they can be easily parallelized on a large scale
        Multihead_attention:
                         ┌────┴────┐
                         │ Concat  │
                         └─────────┘
                              ▲
                              │ 
                             ┌┼┐
                             │││
                     ┌───────││┴──────────┐
                    ┌────────│┴──────────┐│
                   ┌─────────┴──────────┐││
                   │ Scaled Dot-Product ││┘
                   │     Attention      │┘
                   └─┬───────┬────────┬─┘
                 ┌───┘││     │││      ││└──┐
                 │┌───┘│     │││      │└──┐│
                 ││┌───┘     │││      └──┐││
              ┌──││┴───┐  ┌──││┴───┐  ┌──││┴───┐
             ┌───│┴───┐│ ┌───│┴───┐│ ┌───│┴───┐│
            ┌────┴───┐│┘┌────┴───┐│┘┌────┴───┐│┘
            │ Linear │┘ │ Linear │┘ │ Linear │┘ 
            └────────┘  └────────┘  └────────┘  
                 ▲           ▲           ▲
                 │           │           │
               Query        Key        Value
        Architecture:
                                        Output Probabilities
                                                 ▲
                                                 │
                                            ┌─────────┐
                                            │ Softmax │
                                            └─────────┘
                                                 ▲
                                                 │
                                             ┌────────┐
                                             │ Linear │
                                             └────────┘
                                                 ▲
                                                 ├─┐
                                                 ├┐│
                                          ┌──────││┼──────────┐
                                         ┌───────│┼──────────┐│
                                        ┌────────┼──────────┐││ 
                                        │ ┌──────┴──────┐   │││
                                        │ │ Add & Norm  │<┐ │││
                                        │ └──────┬──────┘ │ │││
                   ┌────────────────┐   │ ┌──────┴──────┐ │ │││
                  ┌────────────────┐│   │ │     Feed    │ │ │││
                 ┌────────────────┐││   │ │   Forward   │ │ │││
                 │││              │││   │ └─────────────┘ │ │││
          ┌──────││┼──────────┐   │││   │        ▲        │ │││
         ┌───────│┼──────────┐│   │││   │        ├────────┘ │││
        ┌────────┼──────────┐││   │││   │ ┌─────────────┐   │││
        │ ┌──────┴──────┐   │││   │││   │ │ Add & Norm  │<┐ │││
        │ │ Add & Norm  │<┐ │││   │││   │ └──────┬──────┘ │ │││
        │ └──────┬──────┘ │ │││   │││   │ ┌──────┴──────┐ │ │││
        │ ┌──────┴──────┐ │ │││   │││   │ │   Masked    │ │ │││
        │ │     Feed    │ │ │││   │││   │ │ Multi-Head  │ │ │││
        │ │   Forward   │ │ │││   │││   │ │  Attention  │ │ │││
        │ └─────────────┘ │ │││   ││└───│ └─────────────┘ │ │││
        │        ▲        │ │││   │└────│   ▲    ▲    ▲   │ │││
        │        ├────────┘ │││   └─────┼───┴────┘    ├───┘ │││
        │ ┌─────────────┐   │││         │ ┌─────────────┐   │││
        │ │ Add & Norm  │<┐ │││         │ │ Add & Norm  │<┐ │││
        │ └──────┬──────┘ │ │││         │ └──────┬──────┘ │ │││
        │ ┌──────┴──────┐ │ │││         │ ┌──────┴──────┐ │ │││
        │ │             │ │ │││         │ │   Masked    │ │ │││
        │ │ Multi-Head  │ │ │││         │ │ Multi-Head  │ │ │││
        │ │  Attention  │ │ │││         │ │  Attention  │ │ │││
        │ └─────────────┘ │ │││         │ └─────────────┘ │ │││
        │   ▲    ▲    ▲   │ │││         │   ▲    ▲    ▲   │ │││
        │   └────┼────┘   │ ││┘         │   └────┼────┘   │ ││┘
        │        ├────────┘ │┘          │        ├────────┘ │┘
        └────────┼──────────┘           └────────┼──────────┘
                 ▲                               ▲
                 │                               │
                /‾\                             /‾\
               | + |   Positonal - Encodings   | + |
                \_/                             \_/
                 ▲                               ▲
                 │                               │
           ┌───────────┐                   ┌───────────┐
           │   Input   │                   │  output   │
           │ Embedding │                   │ Embedding │
           └───────────┘                   └───────────┘
                 ▲                               ▲
                 │                               │
              Inputs                   Outputs shifted right


    LLM:
        
            https://thqihve5.bearblog.dev/the-modern-library-of-babel/
        • "Large Language Models"
        — the context window is the largest input a model can take;
          since they have no other "mental" storage, this is practically their memory span;
          measured in tokens

    Hyperparameter_optimization:
        • a hyperparameter is a configurable setting of the model that is
          not fine-tuned during training {architecture; activation function}
        • the problem with hyperparameter optimization in the field of AI is that
          we have no mathematical way of knowing how different hyperparameters will
          perform, except for eval-ing them of course, but thats expensively expensive
        • educated guessing while eval-ing a few different setups is the best we can do

    Coding_with_AI:
        — vibe coding:
            • the act of coding with an LLM and forgetting about the existence of the code
            • meant to convey a sense of relief and programming without obstacles
            • the term is often used by otherwise tech-illiterate people
            • one-way-ticket to introduce security vulnerabilities into your code
              which have been otherwise dead since the late 2000's
            • the coining of the term marks the official beginning of the competency crisis
        • for high purity, stable environments LLMs far excel at fetching documentation
        • can be used as a linter
        • can be used for spotting obvious bugs which the developer misses
          due to hyper-fixation on other parts { calling "chroot" instead of "chdir" }
        • the smaller the language, the more confident you can be in
          the quality of the AIs suggestion
        — incredible as a (very expensive) transpiler: // ?!; example
            • you provide pseudo code
            • you can allow the AI to implement smaller functions based on its signature
            • works especially well for scripting
            • the code will come out as high quality as if it was hand written,
              but the annoyances of scripting were skipped
            
                • all typos will be fixed
                • all libraries are resolved
                • all library function names are guessed based on the description
                • all argument orders are corrected
            Tips:
                • do not ask the model to omit comments;
                  even tho its extensive usage of comments is annoying,
                  it helps with its accuracy
                • indentation and proper usage of newlines is enough
                  for the model to judge the span of code blocks
            {
              /* Very simple example:
               *  This is a real need that came up while
               *   writting the CDD to HTML pipeline.
               */
              // Prompt:
                complete the following pseudo code:
                perl script
                strict mode

                function usage
                    print name "<file>"
                    die

                if no arguments
                    print usage
                end if

                lines = open argv[1]
                if error usage

                while last line matches regex "#define .*"
                    delete last line

                print lines
              // Output (comments stripped):
                #!/usr/bin/perl
use strict;
use warnings;

sub usage {
    print "$0 <file>\n";
    die;
}

@ARGV == 1 or usage();

open my $fh, '<', $ARGV[0] or usage();
my @lines = <$fh>;
close $fh;

while (@lines && $lines[-1] =~ /#define .*/) {
    pop @lines;
}

print @lines;

              /* The output is literally perfect.
               *  It looks exactly like what I wanted.
               *  In fact, -I'm currently not great with Perl-
               *   I had no idea you can call just `pop`.
               * Obviously, this is a trivial script.
               *  If your concern is "how often does something
               *   like this come up?", my answer is that,
               *   maybe you should create more such situations.
               *   UNIX-ize your work process and you will be able
               *   continuously harness the power of AI without
               *   ever creating spaghetti.
               *  If your concern is "well, how much time did it really safe?",
               *   then whats important to realize that my knowledge
               *   of the language did not bottleneck my development speed.
               *   There were no seconds waster remembering symbol names
               *    or going back to add missing semi-colons.
               *   My typing speed and auto-complete also became irrelevant.
               */
            }