INDEX
Explanations
phrases indicating positive attributes or affirmations
New Auto-Interp
Negative Logits
oplan
-0.17
uggy
-0.15
sez
-0.14
pei
-0.14
anch
-0.14
subtitle
-0.14
anchor
-0.14
elop
-0.14
uye
-0.13
ropolitan
-0.13
POSITIVE LOGITS
virtually
0.16
hol
0.16
LEGRO
0.15
skoro
0.15
Physical
0.15
almost
0.15
nearly
0.15
oric
0.15
esteem
0.15
ĵåIJį
0.15
Activations Density 0.021%