INDEX
Explanations
references to the concept of free speech
New Auto-Interp
Negative Logits
stro
-0.17
ivet
-0.16
stag
-0.15
errupted
-0.15
drv
-0.15
lights
-0.15
urally
-0.15
æĺĩ
-0.15
çͲ
-0.14
riad
-0.14
POSITIVE LOGITS
-wheel
0.27
fall
0.25
edom
0.25
speech
0.24
boot
0.24
floating
0.24
enterprise
0.23
-enter
0.22
-market
0.22
hold
0.22
Activations Density 0.025%