INDEX
Explanations
questions expressing curiosity or seeking information about the nature or type of something
phrases relating to types or categories
New Auto-Interp
Negative Logits
eni
-0.77
orest
-0.76
ĸļ
-0.73
enes
-0.72
esty
-0.71
pak
-0.70
cius
-0.70
arest
-0.69
Pigs
-0.69
ences
-0.67
POSITIVE LOGITS
thing
0.78
relationship
0.72
luck
0.72
manners
0.68
surprises
0.68
millenn
0.68
monster
0.67
deal
0.66
goodies
0.65
sleeper
0.64
Activations Density 0.043%