INDEX
    Explanations

    words related to morality and ethical behavior

    New Auto-Interp
    Negative Logits
    ierrez
    -0.65
    izoph
    -0.60
     coasts
    -0.60
    ritch
    -0.59
    rooms
    -0.57
     awhile
    -0.56
     4090
    -0.56
     cooldown
    -0.55
    redo
    -0.54
     Estimated
    -0.54
    POSITIVE LOGITS
    iak
    0.94
    stru
    0.89
    line
    0.85
    bol
    0.85
    lines
    0.84
    ko
    0.83
    ais
    0.83
    ī
    0.80
    SHIP
    0.79
    opter
    0.78
    Act Density 0.029%

    No Known Activations