INDEX
Explanations
phrases indicating outcomes or consequences
New Auto-Interp
Negative Logits
lace
-0.17
Pur
-0.17
ighton
-0.17
er
-0.16
иÑģÑĤ
-0.16
als
-0.15
thing
-0.15
ForResult
-0.14
ÑĸÑĤи
-0.14
Kết
-0.14
POSITIVE LOGITS
antly
0.33
물ìĿĦ
0.23
물
0.19
ingly
0.19
ants
0.18
ados
0.18
oure
0.17
antz
0.17
ntag
0.17
ively
0.16
Activations Density 0.057%