INDEX
Explanations
phrases that indicate conditionality or absence
New Auto-Interp
Negative Logits
ãĥī
-0.77
ãĥĺ
-0.73
quer
-0.72
mon
-0.71
ery
-0.69
mers
-0.68
late
-0.67
rolled
-0.66
oka
-0.66
cow
-0.66
POSITIVE LOGITS
risking
0.96
knowing
0.91
encountering
0.88
sacrificing
0.86
mentioning
0.85
noticing
0.82
compromising
0.81
recourse
0.79
realizing
0.78
violating
0.76
Activations Density 0.020%