INDEX
Explanations
positive adjectives followed by nouns
New Auto-Interp
Negative Logits
ini
0.46
Logos
0.43
ico
0.39
Locking
0.38
ida
0.37
Hyper
0.37
Hyper
0.37
ulagway
0.36
akkhati
0.36
ino
0.36
POSITIVE LOGITS
puzzling
0.45
aead
0.40
domain
0.39
physic
0.38
даже
0.38
quantitative
0.38
veg
0.38
booze
0.38
ail
0.37
chieft
0.37
Activations Density 0.000%