INDEX
Explanations
expressing curiosity to learn
New Auto-Interp
Negative Logits
kwamba
0.49
being
0.43
embracing
0.42
admitting
0.42
being
0.42
bahwa
0.41
accepting
0.40
BEING
0.40
absorbing
0.40
cope
0.39
POSITIVE LOGITS
ity
0.54
curry
0.52
ITY
0.51
জানতে
0.48
george
0.47
Curry
0.47
Understand
0.44
/*
0.43
了解
0.43
cur
0.42
Activations Density 0.007%