INDEX
Explanations
references to social awkwardness or uncomfortable situations
New Auto-Interp
Negative Logits
ched
-0.16
hook
-0.16
éĽĦ
-0.16
çī
-0.16
ita
-0.15
allon
-0.15
leta
-0.15
Hook
-0.15
@student
-0.15
пеÑĩ
-0.15
POSITIVE LOGITS
launcher
0.15
dialog
0.14
оÑĢон
0.14
sil
0.14
dialogue
0.14
afil
0.14
itol
0.13
ais
0.13
situations
0.13
ãĥįãĥ«
0.13
Activations Density 0.030%