INDEX
Explanations
phrases that illustrate contrasts between positive and negative concepts
New Auto-Interp
Negative Logits
azu
-0.18
cab
-0.15
-describedby
-0.15
771
-0.14
ymbols
-0.14
PFN
-0.14
hoo
-0.13
xdb
-0.13
cab
-0.13
taxi
-0.13
POSITIVE LOGITS
ieux
0.16
'gc
0.16
æŃ
0.15
Cunning
0.14
©
0.13
WP
0.13
-sur
0.13
æĿī
0.13
odzi
0.13
itre
0.13
Activations Density 0.074%