INDEX
Explanations
phrases that compare and contrast positive and negative aspects
phrases indicating a contrast between positive and negative aspects
New Auto-Interp
Negative Logits
vernment
-0.82
sbm
-0.78
ICLE
-0.75
20439
-0.74
quit
-0.69
Intake
-0.69
ebin
-0.67
ciating
-0.67
sections
-0.66
igraph
-0.66
POSITIVE LOGITS
evil
1.01
brightest
0.99
evil
0.99
cheerful
0.96
shiny
0.96
noble
0.96
honorable
0.94
fluffy
0.94
virtuous
0.91
tidy
0.91
Activations Density 0.110%