Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Singh, K., von Platen, P., Saraf, Y., Pino, J., Baevski, A., Conneau, A., & Auli, M. (2022). XLS-R: Self-supervised cross-lingual speech representation learning at scale. In Proceedings of Interspeech 2022 (pp. 2278–2282). ISCA. https://doi.org/10.21437/Interspeech.2022-143
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (pp. 12449–12460). Curran Associates.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020) (pp. 8440–8451). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.747
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. https://doi.org/10.1145/3458723
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., S. Wang, L. Wang, & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR 2022).
Jiang, A., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., & El Sayed, W. (2023). Mistral: Efficient dense transformer for language modeling [Preprint]. arXiv:2310.06825.
Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Ortiz Suarez, P., Orife, I., … Adeyemi, M. (2022). Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). Model cards for model reporting. In Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency (FAccT 2019) (pp. 220–229). ACM. https://doi.org/10.1145/3287560.3287596
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023).
ASR
Bérard, A., Besacier, L., Pellegrini, T., & Schwab, D. (2021). Cross-lingual transfer for ASR in low-resource languages [Preprint]. arXiv:2005.04290.
IARPA Babel Program. (2016). Low resource speech recognition (Research Program overview). Retrieved from https://www.iarpa.gov/research-programs/babel
Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative adversarial networks for efficient and high-fidelity speech synthesis. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020).
Łańcucki, A. (2021). FastPitch: Parallel text-to-speech with pitch prediction. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021) (pp. 6588–6592). IEEE. https://doi.org/10.1109/ICASSP39728.2021.9413381
Mariani, J., Negri, M., Turchi, M., & Rotolo, M. (2022). Italian dialect ASR using wav2vec 2.0 [Preprint]. arXiv:2205.02732.
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). LibriSpeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015) (pp. 5206–5210). IEEE. https://doi.org/10.1109/ICASSP.2015.7178964
Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of INTERSPEECH 2019 (pp. 2613–2617). ISCA. https://doi.org/10.21437/Interspeech.2019-2680
Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-Zarandi, M., Baevski, A., Adi, Y., Zhang, X., Hsu, W.-N., Conneau, A., & Auli, M. (2024). Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97), 1–52.
Radford, A., Kim, J. W., Xu, T., Brockman, G., Mcleavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023) (Vol. 202, pp. 28492–28518). PMLR.
Slam, W., Li, Y., & Urouvas, N. (2023). Frontier research on low-resource speech recognition technology. Sensors, 23(22), 9096. https://doi.org/10.3390/s23229096
Wang, Y., & Cao, Y. (2022). Phonetic lexicon design for under-resourced languages: A case study on Tu. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022) (pp. 8562–8566). IEEE. https://doi.org/10.1109/ICASSP43922.2022.9746580
Yksel, H., Krüger, A., & Kirchhoff, K. (2023). NoRefER: A referenceless quality metric for automatic speech recognition [Preprint]. arXiv:2304.00612.
TTS
Casanova, E., Weber, J., Shulby, C. D., Candido, A. V., Gölge, E., & Ponti, M. A. (2022). YourTTS: Towards zero-shot multi-speaker TTS and voice conversion for everyone. In Proceedings of the 39th International Conference on Machine Learning (ICML 2022) (Vol. 162, pp. 2709–2720). PMLR. Retrieved from https://proceedings.mlr.press/v162/casanova22a.html
Łańcucki, A. (2021). FastPitch: Parallel text-to-speech with pitch prediction. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021) (pp. 6588–6592). IEEE. https://doi.org/10.1109/ICASSP39728.2021.9413381
Evaluation
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002) (pp. 311–318). ACL. https://doi.org/10.3115/1073083.1073135
Popović, M. (2015). chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the 10th Workshop on Statistical Machine Translation (WMT 2015) (pp. 392–395).
Rei, R., Farinha, A. C., Lavie, A., & Specia, L. (2020). COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020) (pp. 2685–2702). ACL. https://doi.org/10.18653/v1/2020.emnlp-main.213
Sadat, F., Kazemi, F., & Farzindar, A. (2014). Automatic identification of Arabic dialects in social media. In Proceedings of the 1st Workshop on Arabic Natural Language Processing (VarDial 2014) (pp. 43–53).
Scherrer, Y., & Ljubešić, N. (2020). Discriminating between similar languages in Swiss German texts. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2020) (pp. 155–163). ACL. https://doi.org/10.18653/v1/2020.vardial-1.17
Yoshimura, T., Stølsmark, H., Saino, K., Wang, X., Kubin, G., & Yamagishi, J. (2023). Rethinking mean opinion scores in speech quality assessment. In Proceedings of INTERSPEECH 2023 (pp. 2068–2072). ISCA.
HITL / Active Learning
Nguyen, A., Wallace, E., Iyyer, M., & Neubig, G. (2022). Active learning for low-resource neural machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022) (pp. 6413–6428). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.444