INDEX
Explanations
expressions related to intellectualism and critique of societal norms
New Auto-Interp
Negative Logits
riot
-0.15
Guy
-0.14
ewolf
-0.14
milf
-0.14
811
-0.14
alth
-0.14
izard
-0.14
Cougar
-0.13
ordo
-0.13
ehir
-0.13
POSITIVE LOGITS
types
0.26
-types
0.22
types
0.22
flakes
0.21
rub
0.21
ecc
0.21
provinc
0.21
Types
0.21
mis
0.20
dol
0.20
Activations Density 0.371%