INDEX
Explanations
instances of specific words related to measurement or statistics
New Auto-Interp
Negative Logits
illage
-0.17
.nasa
-0.16
DN
-0.16
cles
-0.15
inde
-0.14
ida
-0.14
atively
-0.14
ative
-0.14
amente
-0.14
azu
-0.14
POSITIVE LOGITS
phins
0.17
Dw
0.17
dw
0.17
fault
0.15
anzi
0.15
raid
0.14
_WAKE
0.14
sett
0.14
.dw
0.14
çĵ
0.14
Activations Density 0.017%