INDEX
Explanations
verbs related to social interactions between individuals
interactions or conflicts between parties
New Auto-Interp
Negative Logits
aroo
-0.72
unction
-0.70
wcs
-0.66
lot
-0.66
roma
-0.66
adier
-0.64
videos
-0.63
learn
-0.62
marine
-0.59
levard
-0.58
POSITIVE LOGITS
each
2.23
each
1.76
Each
1.46
apiece
1.44
Each
1.32
one
1.00
themselves
0.92
another
0.84
selves
0.80
respective
0.79
Activations Density 0.391%