INDEX
Explanations
accusatory statements or allegations
phrases indicating accusations
New Auto-Interp
Negative Logits
Mehran
-0.77
Score
-0.74
ocket
-0.68
alde
-0.67
Tokens
-0.65
Zone
-0.64
dayName
-0.64
edin
-0.63
aths
-0.63
oor
-0.62
POSITIVE LOGITS
conspiring
1.10
violating
1.06
being
1.04
committing
0.97
having
0.96
hypocrisy
0.96
wrongdoing
0.95
abusing
0.94
misconduct
0.93
stealing
0.91
Activations Density 0.053%