INDEX
Explanations
expressions of criticism towards societal norms and behaviors
New Auto-Interp
Negative Logits
бÑĭ
-0.15
ycz
-0.15
ÙħÙĨÙĩ
-0.15
uche
-0.15
ilar
-0.14
داد
-0.14
ongyang
-0.14
ooo
-0.14
canh
-0.14
μοί
-0.13
POSITIVE LOGITS
ones
0.16
inson
0.15
olk
0.15
erus
0.14
eria
0.14
ãĥ³ãĤº
0.14
TA
0.13
Lup
0.13
Lob
0.13
ones
0.13
Activations Density 0.519%