INDEX
Explanations
abstract concepts following "of"
New Auto-Interp
Negative Logits
ທ່ານ
0.54
ح
0.54
У
0.54
the
0.53
Произ
0.52
S
0.51
échantillons
0.51
Обра
0.49
Obviously
0.49
they
0.47
POSITIVE LOGITS
sorts
1.01
course
0.81
interest
0.74
colonialism
0.65
normalcy
0.62
betrayal
0.62
disbelief
0.62
contention
0.61
interplay
0.59
wrongdoing
0.59
Activations Density 0.807%