INDEX
    Explanations

    instances where something is being done or needs to be done

    New Auto-Interp
    Negative Logits
    ipel
    -0.70
    aten
    -0.66
     Arri
    -0.65
     Torn
    -0.64
     Sect
    -0.63
    ioxide
    -0.61
    illi
    -0.61
     corridors
    -0.60
     sshd
    -0.60
     passages
    -0.60
    POSITIVE LOGITS
     wrong
    0.94
     differently
    0.94
     else
    0.92
     proactive
    0.89
     rash
    0.88
    wrong
    0.85
    else
    0.81
     naughty
    0.81
     drastic
    0.80
     unethical
    0.80
    Act Density 0.052%

    No Known Activations