INDEX
Explanations
words related to political events and actions
New Auto-Interp
Negative Logits
imperfect
-0.72
americ
-0.69
Reviewer
-0.68
eccentric
-0.68
felon
-0.67
neutrality
-0.67
annoyance
-0.66
surn
-0.66
spoilers
-0.66
impulse
-0.66
POSITIVE LOGITS
ighed
1.39
uates
1.23
ating
1.15
uated
1.15
ved
1.15
istered
1.14
uating
1.14
ated
1.14
icating
1.14
ussed
1.14
Activations Density 0.250%