INDEX
Explanations
nouns and phrases that denote positive attributes or endorsements
New Auto-Interp
Negative Logits
proto
-0.16
Jeh
-0.15
Proto
-0.15
.gb
-0.14
Pike
-0.14
prot
-0.14
blink
-0.14
oucher
-0.14
peat
-0.14
enger
-0.14
POSITIVE LOGITS
assen
0.14
dorf
0.14
lateral
0.14
abcdefgh
0.13
china
0.13
inar
0.13
اس
0.13
yst
0.13
heterosexual
0.13
↵↵
0.13
Activations Density 0.075%