INDEX
Explanations
instances of conversational markers and expressions of enthusiasm
New Auto-Interp
Negative Logits
Alongside
-0.80
:')
-0.79
Alongside
-0.77
AndEndTag
-0.75
klart
-0.72
;-;
-0.71
͡°
-0.69
subreddit
-0.69
screenshot
-0.68
tryna
-0.68
POSITIVE LOGITS
BTW
0.89
BTW
0.88
muß
0.87
IMHO
0.76
müßte
0.73
OK
0.72
läßt
0.72
Надо
0.72
mußte
0.71
Thanx
0.69
Activations Density 0.469%