INDEX
Explanations
phrases indicating quality or evaluation
New Auto-Interp
Negative Logits
kasarigan
-0.82
bonté
-0.74
crutches
-0.71
fantasies
-0.71
]")]
-0.70
miracles
-0.70
synergies
-0.68
nightmares
-0.68
oracles
-0.67
skyscrapers
-0.67
POSITIVE LOGITS
number
0.80
amount
0.77
few
0.73
lot
0.72
degree
0.72
portion
0.66
range
0.65
way
0.64
list
0.59
sense
0.59
Activations Density 0.492%