INDEX
    Explanations

    expressions of simplicity versus complexity

    New Auto-Interp
    Negative Logits
    rend
    -0.20
    anik
    -0.15
    ilm
    -0.14
    uja
    -0.14
    lef
    -0.14
     Alphabet
    -0.14
    oden
    -0.14
    opot
    -0.13
    Ĥæķ°
    -0.13
    airo
    -0.13
    POSITIVE LOGITS
     simple
    0.33
     Simple
    0.31
    simple
    0.30
    -simple
    0.28
     simplicity
    0.28
    ç®Ģåįķ
    0.27
     simples
    0.27
     SIMPLE
    0.27
    Simple
    0.27
    _simple
    0.23
    Act Density 0.155%

    No Known Activations