INDEX
Explanations
phrases describing imaginative scenarios
hypothetical scenarios and thought experiments
New Auto-Interp
Negative Logits
Nonetheless
-0.77
Nevertheless
-0.75
Exit
-0.73
Nevertheless
-0.68
"},
-0.68
particularly
-0.67
Nonetheless
-0.66
ason
-0.65
unfocusedRange
-0.65
so
-0.63
POSITIVE LOGITS
Imagine
1.10
Imagine
1.09
scenario
1.07
hypot
0.98
imagine
0.98
hypothetical
0.90
scenarios
0.89
Suppose
0.88
dystopian
0.81
suddenly
0.77
Activations Density 0.396%