INDEX
Explanations
phrases relating to social norms and values
New Auto-Interp
Negative Logits
OLT
-0.17
ops
-0.15
egra
-0.15
rows
-0.15
APH
-0.15
asti
-0.15
.rows
-0.14
enda
-0.14
orr
-0.14
forks
-0.14
POSITIVE LOGITS
ochen
0.17
yük
0.15
iggs
0.15
adle
0.15
ç¼
0.14
Stuff
0.14
etc
0.14
quot
0.14
nas
0.14
Rails
0.14
Activations Density 0.318%