INDEX
Explanations
pronouns followed by actions
New Auto-Interp
Negative Logits
exceeds
0.35
ensures
0.34
cục
0.34
promotes
0.33
受
0.32
*
0.32
payoff
0.31
encompasses
0.31
interacts
0.31
grooming
0.31
POSITIVE LOGITS
hadn
0.60
knew
0.46
couldn
0.42
remembered
0.39
ida
0.39
wasn
0.39
sabía
0.39
soon
0.39
explained
0.37
laughed
0.37
Activations Density 0.021%