With some crafty Liquid string manipulation you might also be able to do this in Liquid as well. There's a lot of caveats in that approach.
I am indeed referring to HTML figures where you have an image and a figcaption. I've never heard of emojis being called figures.
There is also the issue with determining which images and text should be inside a figure and which should just be an image with a paragraph after it. I'm not sure how I would do that without implementing some sort of identefier that the user could put in to signify the start and end of a figure, which wouldn't be much better than just teaching them HTML.