INDEX
Explanations
verbs or phrases indicating potential threats or risks
phrases that describe potential threats
New Auto-Interp
Negative Logits
tery
-0.70
lex
-0.69
ocket
-0.68
Zup
-0.67
à¼
-0.62
ILCS
-0.61
@#&
-0.60
Nanto
-0.60
ergy
-0.60
Ago
-0.59
POSITIVE LOGITS
idon
1.16
posed
0.89
atoon
0.83
poses
0.82
pose
0.81
hazards
0.76
dangers
0.72
Danger
0.72
hran
0.71
vulner
0.71
Activations Density 0.019%