INDEX
Explanations
words related to physical harm or damage
terms related to death and mortality
New Auto-Interp
Negative Logits
XY
-0.74
FFFF
-0.73
HCR
-0.67
ERG
-0.66
advertisement
-0.65
herty
-0.63
AUD
-0.63
XXXX
-0.62
Potential
-0.62
Specific
-0.61
POSITIVE LOGITS
mort
1.37
uary
1.02
ally
0.84
gue
0.81
veter
0.80
srfAttach
0.79
surpr
0.78
dism
0.77
osate
0.76
embr
0.76
Activations Density 0.010%