INDEX
Explanations
phrases that contrast opposing viewpoints or options
New Auto-Interp
Negative Logits
dr
-0.77
nz
-0.75
zag
-0.74
zos
-0.71
rawdownloadcloneembedreportprint
-0.71
dra
-0.70
sed
-0.69
renches
-0.69
sung
-0.68
ciating
-0.68
POSITIVE LOGITS
lobe
0.68
reconcil
0.65
naïve
0.65
intellectually
0.62
anarch
0.60
coasts
0.60
behalf
0.59
depict
0.59
foremost
0.59
representing
0.58
Activations Density 0.030%