INDEX
Explanations
phrases that express knowledge or awareness
New Auto-Interp
Negative Logits
ero
-0.18
ids
-0.17
ucid
-0.15
ãĤīãģĦ
-0.15
ICLE
-0.15
aso
-0.14
andalone
-0.14
_kernel
-0.14
hav
-0.14
оÑģÑĤ
-0.13
POSITIVE LOGITS
ledged
0.30
-how
0.28
led
0.28
æĻĵ
0.26
ledge
0.25
lege
0.23
ingly
0.23
LED
0.21
about
0.20
ledger
0.18
Activations Density 0.116%