INDEX
Explanations
security and safety-related instructions or warnings
imperative statements or advice directing actions
New Auto-Interp
Negative Logits
ELD
-0.74
atz
-0.64
Said
-0.62
ById
-0.62
lishes
-0.61
Winner
-0.58
crashed
-0.55
fred
-0.55
lime
-0.55
ammed
-0.55
POSITIVE LOGITS
beware
1.40
consult
1.26
avoid
1.16
carefully
1.15
ALWAYS
1.13
acquaint
1.12
consider
1.09
heed
1.08
refrain
1.08
avoid
1.08
Activations Density 0.215%