INDEX
Explanations
expressions of criticism and criticism-related vocabulary
New Auto-Interp
Negative Logits
ги
-0.16
enha
-0.16
ki
-0.15
allet
-0.15
identity
-0.14
絡
-0.14
cki
-0.14
czy
-0.14
bore
-0.14
Madness
-0.14
POSITIVE LOGITS
acos
0.17
oise
0.16
IPA
0.16
acas
0.15
bersome
0.14
asting
0.14
ingly
0.14
hur
0.14
ADVISED
0.13
Samar
0.13
Activations Density 0.063%