INDEX
Explanations
references to friendship and social relationships
New Auto-Interp
Negative Logits
eria
-0.17
erse
-0.16
Himself
-0.15
herself
-0.15
urve
-0.15
ãĤ¤ãĤ¯
-0.15
himself
-0.15
alth
-0.14
themselves
-0.14
benh
-0.14
POSITIVE LOGITS
lier
0.40
liness
0.33
liest
0.32
/ac
0.32
lies
0.27
whom
0.26
/lo
0.25
circle
0.25
/f
0.25
ships
0.25
Activations Density 0.075%