INDEX
Explanations
phrases that indicate comparisons or examples
New Auto-Interp
Negative Logits
ynn
-0.17
Boeh
-0.16
ught
-0.15
bsolute
-0.14
ubar
-0.14
ustr
-0.14
WithOptions
-0.14
icemail
-0.13
*)"
-0.13
atalog
-0.13
POSITIVE LOGITS
że
0.15
upp
0.13
Dow
0.12
patrick
0.12
lena
0.12
courthouse
0.12
icha
0.12
Sext
0.12
Terror
0.12
ãģ¾ãģĻ
0.12
Activations Density 0.051%