INDEX
Explanations
instances of specific actions or occurrences within the text
New Auto-Interp
Negative Logits
ulg
-0.20
ulis
-0.17
arios
-0.16
iores
-0.16
zew
-0.16
ecta
-0.16
assin
-0.15
charts
-0.15
ycin
-0.15
iciel
-0.15
POSITIVE LOGITS
ë°±
0.16
Frank
0.16
circ
0.16
Norm
0.15
cline
0.15
Crescent
0.15
Opt
0.14
Cree
0.14
ila
0.14
rr
0.14
Activations Density 0.008%