INDEX
Explanations
words related to discovery and exploration
New Auto-Interp
Negative Logits
statt
-0.18
soever
-0.18
unn
-0.17
rott
-0.16
Aviv
-0.15
hee
-0.15
quired
-0.15
uracy
-0.15
/she
-0.15
igi
-0.15
POSITIVE LOGITS
ies
0.24
ry
0.23
IES
0.21
ability
0.20
verse
0.19
ogue
0.19
ries
0.18
ments
0.17
ment
0.17
ively
0.16
Activations Density 0.023%