INDEX
Explanations
cult or cult-like discussions
New Auto-Interp
Negative Logits
ле
1.10
ни
1.01
ла
1.01
сах
0.96
il
0.95
лама
0.91
ра
0.87
лили
0.86
Сасик
0.84
ිය
0.83
POSITIVE LOGITS
t
1.24
0
1.15
Cult
1.09
0.98
\
0.95
1
0.94
Cult
0.93
]
0.93
↵
0.91
on
0.91
Activations Density 0.002%