INDEX
Explanations
scientific findings or research results
phrases related to research findings and reported behaviors
New Auto-Interp
Negative Logits
onica
-0.75
quote
-0.69
onto
-0.69
steel
-0.68
oland
-0.68
undo
-0.68
united
-0.68
prep
-0.67
raft
-0.67
icus
-0.66
POSITIVE LOGITS
slight
1.30
reductions
1.26
negligible
1.26
significant
1.26
declines
1.25
decreases
1.21
fewer
1.21
minimal
1.19
substantial
1.18
decreased
1.18
Activations Density 0.296%