INDEX
Explanations
expressions related to societal critiques and personal accountability
New Auto-Interp
Negative Logits
agen
-0.17
urette
-0.15
Ĭ
-0.15
Yep
-0.15
.metro
-0.15
imore
-0.14
Yup
-0.14
Nope
-0.14
crew
-0.14
Yep
-0.14
POSITIVE LOGITS
ALSO
0.17
also
0.17
also
0.17
Also
0.16
Also
0.15
yo
0.15
thems
0.15
fine
0.14
hi
0.14
Ñĩа
0.14
Activations Density 0.173%