INDEX
Explanations
mentions of individuals and their actions or statements
New Auto-Interp
Negative Logits
ãĤ¼
-0.65
Enhance
-0.65
Beaut
-0.64
Dise
-0.63
ishable
-0.63
Travels
-0.62
Masquerade
-0.61
Cutting
-0.61
Patch
-0.61
MAP
-0.61
POSITIVE LOGITS
replied
1.53
answered
1.40
reply
1.38
replies
1.38
responded
1.34
answer
1.29
hesitated
1.25
answers
1.16
answ
1.12
Answer
1.11
Activations Density 0.201%