INDEX
Explanations
pronouns followed by verbs indicating personal actions or opinions
phrases related to personal feelings and self-perception
New Auto-Interp
Negative Logits
Need
-0.67
ij士
-0.67
OTOS
-0.66
Missing
-0.66
Bei
-0.65
CHAT
-0.64
ousands
-0.63
VW
-0.63
artifacts
-0.63
Bugs
-0.62
POSITIVE LOGITS
phr
1.10
handled
1.08
treated
0.98
behaved
0.96
behave
0.95
interacts
0.95
dealt
0.93
interact
0.90
treat
0.90
pronounce
0.89
Activations Density 0.186%