INDEX
Explanations
phrases that indicate user engagement or interaction, particularly commenting
New Auto-Interp
Negative Logits
irc
-0.15
reek
-0.15
lene
-0.15
aurus
-0.15
ayer
-0.15
iban
-0.14
inkel
-0.14
-League
-0.14
ercise
-0.14
elves
-0.14
POSITIVE LOGITS
CriticalSection
0.20
hart
0.19
697
0.17
hou
0.17
inka
0.17
Leave
0.16
Leave
0.16
enan
0.16
ayette
0.16
ãĥ¬ãĥĥãĥĪ
0.16
Activations Density 0.020%