INDEX
Explanations
words indicating positive outcomes or advantages
New Auto-Interp
Negative Logits
Perkins
-0.15
khá»ıi
-0.15
adoras
-0.14
isas
-0.14
asaki
-0.14
ewise
-0.14
oppel
-0.14
zin
-0.14
esor
-0.14
styled
-0.14
POSITIVE LOGITS
rees
0.16
748
0.16
re
0.15
olen
0.14
olland
0.14
Campos
0.14
weise
0.14
strain
0.14
dlg
0.14
chema
0.13
Activations Density 0.012%