INDEX
Explanations
phrases related to accusations or blame
phrases indicating accusations or claims related to individuals or entities
New Auto-Interp
Negative Logits
arton
-0.79
cies
-0.76
rones
-0.75
aic
-0.72
rieve
-0.71
wayne
-0.69
eport
-0.67
mentioned
-0.67
hen
-0.66
tti
-0.66
POSITIVE LOGITS
conspiring
1.37
violating
1.35
abusing
1.30
stealing
1.27
committing
1.27
favoring
1.25
waging
1.23
behaving
1.22
abandoning
1.22
raping
1.21
Activations Density 0.060%