INDEX
Explanations
phrases related to accusations
repeated instances of accusations against entities or individuals
New Auto-Interp
Negative Logits
ower
-0.76
partName
-0.75
alde
-0.70
aware
-0.66
owers
-0.65
reddits
-0.65
adjusts
-0.64
Mehran
-0.64
threshold
-0.63
fades
-0.63
POSITIVE LOGITS
disl
0.75
brutality
0.74
murdering
0.74
gou
0.72
wrongdoing
0.72
treason
0.72
conspiring
0.70
committing
0.70
racism
0.68
witchcraft
0.68
Activations Density 0.052%