INDEX
Explanations
references to the number "one"
New Auto-Interp
Negative Logits
hue
-0.16
linger
-0.16
morgan
-0.15
yne
-0.15
licken
-0.14
odor
-0.14
RTL
-0.14
Lair
-0.14
wal
-0.14
bib
-0.14
POSITIVE LOGITS
orz
0.15
Punch
0.15
zim
0.15
punch
0.14
ikan
0.13
Criterion
0.13
Stokes
0.13
олож
0.13
lez
0.13
icons
0.13
Activations Density 0.338%