INDEX
Explanations
negative phrases or terms related to rejection or disapproval
New Auto-Interp
Negative Logits
pga
-0.18
bee
-0.17
ko
-0.17
hev
-0.17
rott
-0.17
ropriate
-0.15
nya
-0.14
rape
-0.14
ritt
-0.14
ritten
-0.14
POSITIVE LOGITS
sey
0.29
veau
0.28
okie
0.26
seg
0.25
ont
0.24
xious
0.23
odge
0.23
isi
0.23
holds
0.23
oks
0.22
Activations Density 0.034%