INDEX
Explanations
responses to questions or statements
instances of dialogue, specifically replies or responses in conversations
New Auto-Interp
Negative Logits
teenth
-0.75
mental
-0.72
fi
-0.69
ctors
-0.66
BALL
-0.66
icipated
-0.66
dar
-0.64
flame
-0.64
Trials
-0.63
cipled
-0.63
POSITIVE LOGITS
thereto
0.98
angrily
0.85
favorably
0.85
sarcast
0.84
affirm
0.84
promptly
0.80
politely
0.79
reply
0.78
enthusiastically
0.77
later
0.75
Activations Density 0.038%