INDEX
Explanations
mentions of physical damage or harm in written text
references to damage or harm
New Auto-Interp
Negative Logits
zsche
-0.78
zee
-0.71
rams
-0.70
ramid
-0.66
atorial
-0.65
Bars
-0.64
liner
-0.63
Pitch
-0.62
liner
-0.60
gent
-0.60
POSITIVE LOGITS
inflicted
1.14
damage
1.01
mitigation
0.97
wrought
0.95
damage
0.87
havoc
0.81
damaged
0.79
incurred
0.78
damages
0.76
horm
0.75
Activations Density 0.020%