INDEX
    Explanations

    concepts related to societal roles and evaluations

    New Auto-Interp
    Negative Logits
    instead
    -0.19
    etc
    -0.19
     instead
    -0.17
    çŃī
    -0.17
    Instead
    -0.16
    undan
    -0.16
    fak
    -0.16
     Instead
    -0.15
    kker
    -0.15
    (or
    -0.15
    POSITIVE LOGITS
     AND
    0.48
     lẫn
    0.45
    AND
    0.34
     että
    0.27
     as
    0.27
     nor
    0.26
    _AND
    0.25
     plus
    0.25
     PLUS
    0.23
     _
    0.23
    Act Density 0.117%

    No Known Activations