INDEX
Explanations
phrases involving the word "of"
New Auto-Interp
Negative Logits
Mehran
-0.77
Score
-0.76
alde
-0.72
Zone
-0.67
edin
-0.66
ocket
-0.65
reddits
-0.64
oor
-0.64
adjusts
-0.64
Ange
-0.64
POSITIVE LOGITS
hypocrisy
1.10
conspiring
1.09
violating
1.07
being
1.06
misrepresent
1.01
neglect
0.95
committing
0.94
having
0.93
wrongdoing
0.93
misconduct
0.92
Activations Density 0.029%