INDEX
Explanations
references to social interactions and experiences
New Auto-Interp
Negative Logits
ſelf
-0.75
houſe
-0.73
purpoſe
-0.73
obſ
-0.71
neceſſ
-0.69
feroit
-0.69
pouvoit
-0.69
becauſe
-0.68
Eſ
-0.68
neceff
-0.68
POSITIVE LOGITS
nab
0.78
snag
0.74
tuck
0.73
sna
0.73
chow
0.71
popped
0.71
indulged
0.70
sneak
0.70
donned
0.70
grab
0.69
Activations Density 0.422%