INDEX
Explanations
terms related to personal opinions and influences
New Auto-Interp
Negative Logits
pleaſure
-0.89
ſelves
-0.84
propOrder
-0.82
kasarigan
-0.79
houſe
-0.78
ſei
-0.78
Houſe
-0.77
Personendaten
-0.77
ſind
-0.76
ſou
-0.75
POSITIVE LOGITS
[
0.30
top
0.30
myself
0.29
我把
0.28
.
0.28
xấu
0.27
"[
0.27
'
0.26
[]
0.25
saw
0.25
Activations Density 0.323%