INDEX
Explanations
damage-related terms or phrases
references to physical damage or harm
New Auto-Interp
Negative Logits
zsche
-0.87
zee
-0.76
rams
-0.70
uesday
-0.66
zbek
-0.64
KER
-0.62
Reloaded
-0.62
liner
-0.62
anamo
-0.62
unin
-0.61
POSITIVE LOGITS
inflicted
1.13
mitigation
0.95
wrought
0.95
damage
0.94
damage
0.82
damaged
0.79
incurred
0.76
caused
0.76
sustained
0.76
havoc
0.75
Activations Density 0.031%