INDEX
Explanations
phrases that denote monitoring and observation
New Auto-Interp
Negative Logits
ivol
-0.16
228
-0.15
666
-0.15
weise
-0.15
369
-0.14
PR
-0.14
706
-0.14
409
-0.14
ips
-0.14
own
-0.14
POSITIVE LOGITS
eil
0.15
eut
0.15
osit
0.14
iel
0.14
fang
0.14
sight
0.14
erset
0.14
Ã¥de
0.14
enan
0.14
ffen
0.14
Activations Density 0.033%