INDEX
Explanations
words related to challenging conventional wisdom or uncharted territories
New Auto-Interp
Negative Logits
gra
-0.68
lette
-0.67
tti
-0.66
ãĥ©ãĥ³
-0.62
elson
-0.62
sov
-0.61
erness
-0.61
conn
-0.60
terday
-0.60
ppo
-0.59
POSITIVE LOGITS
ĸļ
1.12
ĥ
1.10
ģ
1.05
Ģ
1.02
arted
0.91
Ĵ
0.90
ĺ
0.85
ĸ
0.85
ĵ
0.84
anging
0.83
Activations Density 0.035%