INDEX
Explanations
phrases discussing the concept of free speech and its implications
New Auto-Interp
Negative Logits
hangi
-0.16
rid
-0.14
AtPath
-0.14
elig
-0.14
åħ¶ä¸Ń
-0.13
osto
-0.13
Ñĩе
-0.13
ometr
-0.13
ewise
-0.13
ruz
-0.13
POSITIVE LOGITS
basically
0.24
thus
0.24
therefore
0.24
essentially
0.23
donc
0.20
overall
0.19
åĽłæŃ¤
0.18
böylece
0.18
thus
0.17
Therefore
0.17
Activations Density 0.321%