INDEX
Explanations
references to friends and family in the text
references to friends and social connections
New Auto-Interp
Negative Logits
yss
-0.69
qqa
-0.65
oted
-0.63
tarians
-0.60
chloride
-0.59
acco
-0.59
ocalypse
-0.59
Cout
-0.58
secution
-0.58
seizure
-0.58
POSITIVE LOGITS
hips
1.07
lier
0.95
acquaintances
0.94
hip
0.94
friends
0.83
collaborators
0.81
Friends
0.79
colleagues
0.78
friends
0.78
acquaintance
0.78
Activations Density 0.050%