INDEX
Explanations
words related to personal relationships and interactions, specifically focusing on romantic relationships, friendships, and societal views on relationships
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1842
+0.11
0.3%
919
+0.08
0.2%
1978
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
822
+0.11
0.02
151
+0.08
0.05
1441
+0.08
0.04
Negative Logits
Expt
-0.81
diagon
-0.76
Departement
-0.76
saha
-0.75
kule
-0.75
ù
-0.73
teras
-0.73
makro
-0.73
textil
-0.71
permu
-0.71
POSITIVE LOGITS
flirting
0.69
flirt
0.65
courting
0.63
Attractive
0.63
infatu
0.62
romantic
0.60
suitors
0.60
attractiveness
0.60
dating
0.58
woo
0.58
Activations Density 0.616%