INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     co
    -0.66
     mor
    -0.61
    ter
    -0.61
     por
    -0.58
     tre
    -0.57
     ro
    -0.57
    er
    -0.56
     B
    -0.56
     b
    -0.56
     z
    -0.55
    POSITIVE LOGITS
     Monfieur
    1.29
     Efq
    1.28
     ainfi
    1.26
     myſelf
    1.25
    <bos>
    1.16
     vectorielle
    1.12
     pleaſure
    1.12
     itſelf
    1.10
     feroit
    1.10
     themſelves
    1.10
    Act Density 0.226%

    No Known Activations