INDEX
Explanations
phrases emphasizing a strong opinion or negation
emphatic negations and strong disclaimers
New Auto-Interp
Negative Logits
lahoma
-0.73
apters
-0.67
rity
-0.67
urer
-0.65
orio
-0.65
iary
-0.65
liest
-0.65
auga
-0.65
urers
-0.65
ariat
-0.65
POSITIVE LOGITS
LY
1.57
ALLY
1.45
THING
1.45
ELY
1.42
ONE
1.41
HO
1.37
OSE
1.37
LESS
1.36
NESS
1.36
THERE
1.36
Activations Density 0.158%