INDEX
Explanations
phrases that express guiding principles or mottos
New Auto-Interp
Negative Logits
iei
-0.16
oje
-0.15
Mention
-0.15
إد
-0.15
λικ
-0.15
abar
-0.15
mention
-0.14
stery
-0.14
äch
-0.14
ráf
-0.14
POSITIVE LOGITS
motto
0.55
slogan
0.53
theme
0.47
theme
0.40
Theme
0.38
slogans
0.37
logan
0.35
Theme
0.35
themes
0.34
THEME
0.32
Activations Density 0.264%