INDEX
Explanations
indicative statements criticizing societal structures or norms
New Auto-Interp
Negative Logits
wing
-0.16
iddy
-0.15
labs
-0.15
ose
-0.14
رات
-0.14
ếp
-0.14
ign
-0.14
monton
-0.14
stru
-0.14
lak
-0.14
POSITIVE LOGITS
instead
0.69
instead
0.63
Instead
0.61
Instead
0.59
вмеÑģÑĤ
0.47
Nope
0.32
inve
0.28
mÃŃsto
0.27
à¹ģà¸Ĺà¸Ļ
0.25
sondern
0.23
Activations Density 0.207%