INDEX
Explanations
pronouns or possessive words indicating relationships between different entities
references to relationships and interactions between people
New Auto-Interp
Negative Logits
aneously
-0.70
stals
-0.68
ctors
-0.67
unts
-0.63
stein
-0.62
monds
-0.60
aneous
-0.60
172
-0.58
mons
-0.57
ships
-0.57
POSITIVE LOGITS
hip
0.97
heet
0.96
hare
0.93
etter
0.93
etting
0.91
mith
0.91
ilver
0.90
cape
0.81
peed
0.80
erver
0.79
Activations Density 0.294%