INDEX
Explanations
specific nouns and quantities
New Auto-Interp
Negative Logits
ע
0.53
特に
0.52
ることが
0.50
וד
0.49
იკ
0.49
בא
0.46
Caedwalla
0.46
명
0.45
ด่า
0.45
особливо
0.44
POSITIVE LOGITS
comforts
0.48
engagements
0.43
consoles
0.42
riev
0.42
teammates
0.41
me
0.40
workouts
0.40
({0.40
autocratic
0.40
re
0.40
Activations Density 0.005%