INDEX
    Explanations

    sentences that express harm or potential danger

    New Auto-Interp
    Head Attr Weights
    0:0.07
    1:0.03
    2:0.07
    3:0.03
    4:0.06
    5:0.04
    6:0.16
    7:0.05
    8:0.09
    9:0.28
    10:0.03
    11:0.04
    Negative Logits
     Antar
    -3.82
     Laos
    -3.73
     Ser
    -3.61
    Magikarp
    -3.58
     Planes
    -3.55
     Mickey
    -3.54
    Nar
    -3.48
     nar
    -3.43
     Marine
    -3.43
     Stones
    -3.42
    POSITIVE LOGITS
     Rebecca
    8.65
    becca
    6.04
     Reb
    4.60
    beck
    3.68
    abe
    3.66
    Reb
    3.66
    bern
    3.53
    resp
    3.52
    eva
    3.45
    Rev
    3.37
    Act Density 0.001%

    No Known Activations