INDEX
Explanations
phrases that indicate a negative outcome or denial
New Auto-Interp
Negative Logits
ray
-0.16
rape
-0.16
lng
-0.16
erville
-0.15
pga
-0.15
imonial
-0.14
ness
-0.14
packing
-0.14
like
-0.14
rech
-0.14
POSITIVE LOGITS
oks
0.35
sey
0.33
okie
0.33
xious
0.31
ok
0.30
veau
0.30
thin
0.28
thern
0.28
ther
0.28
seg
0.27
Activations Density 0.039%