INDEX
Explanations
phrases that indicate emphasis or strong assertions
New Auto-Interp
Negative Logits
Tos
-0.17
ulp
-0.16
lew
-0.15
rets
-0.15
éĻĪ
-0.15
ider
-0.14
Ben
-0.14
Bart
-0.14
adin
-0.14
_inline
-0.14
POSITIVE LOGITS
pf
0.20
iah
0.15
raj
0.15
Lâm
0.15
кÑĥÑĤ
0.15
Pf
0.14
vý
0.14
sville
0.14
алом
0.14
hammer
0.14
Activations Density 0.025%