INDEX
Explanations
phrases indicating strong opinions or beliefs
New Auto-Interp
Negative Logits
olis
-0.84
UES
-0.74
uctions
-0.74
oute
-0.73
iens
-0.72
ischer
-0.71
ULTS
-0.68
olen
-0.67
ahime
-0.67
ãĥĩãĤ£
-0.66
POSITIVE LOGITS
whatsoever
1.03
respecting
0.84
how
0.81
llor
0.78
whatever
0.78
whether
0.77
theless
0.74
¬¼
0.69
é¾įå¥ij士
0.68
ileged
0.68
Activations Density 0.009%