INDEX
Explanations
terms related to physical harm or destruction
references to physical or structural damage
New Auto-Interp
Negative Logits
KER
-0.68
ulated
-0.67
arte
-0.67
rams
-0.65
zee
-0.64
Kong
-0.63
ulates
-0.62
ILLE
-0.61
ramid
-0.61
Helpful
-0.60
POSITIVE LOGITS
damage
1.21
inflicted
1.05
damage
0.97
mitigation
0.89
damaged
0.86
damages
0.81
Damage
0.78
horm
0.77
Damage
0.77
undermin
0.77
Activations Density 0.012%