INDEX
Explanations
phrases that indicate preference or favor towards something
New Auto-Interp
Negative Logits
awar
-0.83
block
-0.75
/-
-0.69
}}
-0.69
UGH
-0.65
break
-0.65
stan
-0.65
wow
-0.64
nor
-0.62
eno
-0.62
POSITIVE LOGITS
simpler
0.87
simplified
0.82
embracing
0.77
softer
0.76
streamlined
0.74
something
0.71
sleek
0.70
trendy
0.70
concentrating
0.69
embrace
0.67
Activations Density 0.075%