INDEX
Explanations
adjectives describing positive experiences or qualities
New Auto-Interp
Negative Logits
endeavor
-0.19
Favorite
-0.17
savory
-0.17
neighborhood
-0.17
maneuvers
-0.17
favorite
-0.16
neighborhoods
-0.16
favors
-0.16
behavior
-0.16
swath
-0.16
POSITIVE LOGITS
cracking
0.27
advert
0.24
programme
0.23
flavours
0.22
contrib
0.22
intree
0.22
further
0.21
proportion
0.21
emot
0.20
£
0.19
Activations Density 0.373%