INDEX
Explanations
instances of questioning societal norms and behaviors
New Auto-Interp
Negative Logits
agr
-0.15
athe
-0.15
iker
-0.15
ansa
-0.14
ÑĨÑİ
-0.14
aki
-0.14
sing
-0.14
phin
-0.14
Crunch
-0.13
aga
-0.13
POSITIVE LOGITS
fart
0.16
dikke
0.16
Mob
0.15
Vance
0.15
oton
0.15
Wars
0.15
Liberation
0.14
beyond
0.14
yaw
0.14
-append
0.13
Activations Density 0.000%