The Z80 is not really a problem: although it runs at a relatively high frequency (for its days), it needs a lot of clock cycles per instruction, so it doesn't execute all that many instructions per second. Modern Java VMs are quite efficient and modern phones have pretty fast CPUs, so even though you have to run a JITted Z80 interpreter, the result will be faster than a real Z80.
Mapping SCREEN2 characters to textures should render quite fast on phones that support 3D rendering (JSR184). And in some phones that rendering is hardware accelerated, so it can happen in parallel to the rest of the emulation.
I'm not sure about audio though: is there an efficient way to stream realtime synthesized sounds? The usual way to play sounds in J2ME is javax.microedition.media.Player, but this is not designed for streaming. Does anyone know a workaround or a different approach? But in the worst case, you'd have to play without audio, which is probably nicer to the people around you anyway