INDEX
Explanations
phrases related to strong emotions or reactions
emotional reactions and intense responses related to experiences
New Auto-Interp
Negative Logits
mut
-0.66
repl
-0.65
merged
-0.64
zero
-0.63
married
-0.63
ipped
-0.61
substituted
-0.58
fried
-0.57
background
-0.57
ouched
-0.56
POSITIVE LOGITS
bies
0.75
unnecessarily
0.72
territ
0.70
=-=-=-=-
0.68
GGGGGGGG
0.68
aughs
0.67
iday
0.66
Strange
0.64
ãĢĤ
0.63
ikes
0.63
Activations Density 0.229%