INDEX
Explanations
phrases indicating knowledge or understanding
New Auto-Interp
Negative Logits
entes
-0.15
rift
-0.15
hood
-0.15
IFI
-0.15
ype
-0.15
quist
-0.14
umps
-0.14
elah
-0.14
ublic
-0.14
ikes
-0.14
POSITIVE LOGITS
how
0.18
about
0.18
enough
0.17
what
0.15
nothing
0.15
loff
0.15
dist
0.14
.Unknown
0.14
biết
0.14
basic
0.14
Activations Density 0.095%