INDEX
Explanations
phrases related to serious or harmful actions
terms related to severe harm or injury
New Auto-Interp
Negative Logits
XL
-0.72
Elves
-0.72
fix
-0.71
den
-0.71
wallet
-0.68
girl
-0.67
Diver
-0.66
gamer
-0.66
cloth
-0.65
starter
-0.65
POSITIVE LOGITS
ous
1.20
ously
1.19
ising
1.18
icates
1.09
ues
1.08
ized
1.08
izations
1.07
icable
1.06
izing
1.06
istic
1.05
Activations Density 0.046%