INDEX
Explanations
references to weakness or fragility in various contexts
New Auto-Interp
Negative Logits
rike
-0.17
hone
-0.16
íĨµìĭł
-0.16
ingham
-0.16
êt
-0.15
ãĥ³ãĤ°
-0.15
rika
-0.15
asca
-0.15
asu
-0.15
Ậ
-0.15
POSITIVE LOGITS
å¼±
0.27
weak
0.25
Weak
0.24
weak
0.24
Weak
0.24
Ñģлаб
0.23
weakest
0.22
weaker
0.21
-strong
0.18
ly
0.18
Activations Density 0.029%