INDEX
    Explanations

    words related to the concept of "safety" or "security."

    New Auto-Interp
    Negative Logits
    erah
    -0.15
    slack
    -0.15
    atings
    -0.15
    aneously
    -0.15
    REA
    -0.15
    atee
    -0.15
    spy
    -0.15
     refl
    -0.14
    askell
    -0.14
    setattr
    -0.14
    POSITIVE LOGITS
    osten
    0.28
    ott
    0.27
    vil
    0.27
    osp
    0.26
    ottom
    0.26
    vol
    0.26
    otto
    0.25
    periment
    0.25
    ulla
    0.25
    ulle
    0.25
    Act Density 0.006%

    No Known Activations