INDEX
Explanations
adjectives describing positive experiences
descriptions of pleasant or unpleasant experiences and feelings
New Auto-Interp
Negative Logits
arius
-0.68
blindly
-0.64
aan
-0.63
ULE
-0.62
inition
-0.62
ithing
-0.61
limits
-0.61
mining
-0.61
govtrack
-0.60
rules
-0.60
POSITIVE LOGITS
ries
1.09
lihood
0.98
pleasant
0.89
smelling
0.86
surprises
0.82
ness
0.81
ãĥ¼ãĥĨ
0.77
experiences
0.77
rious
0.76
unpleasant
0.76
Activations Density 0.031%