The Ibn Battouta ATC (Air Traffic Control) communication corpus is English corpus characterized with a strong Moroccan accent speech, under the creation and concentrated on Tangier’s airport1. It synchronizes the voice, the ADS-B (flight data broadcast by aircraft) and the METAR (Weather Report on Airport), as trusted structured data frequently pronounced during the communication.
The aim of this work is to update the acoustic model and train the ASR (Automatic Speech Recognition) engine like CMU Sphinx or IBM Kaldi for the purpose of improving ATC capabilities, facilitating decision making and ensuring security. Currently we have recorded more than five hours long audio files with silence elimination of real life communication between pilots and controllers during takeoffs, approaches and landings in Tangier airport, Gibraltar airport and in the way from the Moroccan airspace to the Spanish one and vice versa.
In addition to 10logging files from ADS-B flight data and METAR reports. All audio files have been fully transcribed manually, with time marking to indicate the start of transmission and its duration, by airlines pilots and controllers working on Tangier airport.
To the best knowledge, it’s the first corpus that aligns speech data and structured data perfectly to all for a richer ASR modeling. In this paper, we will describe the Tangier’s airport geographic specification, and the techniques used to ensure quality of recording, data collecting, and their transcription, using the META data annotation and the advantages of the ATC phraseology as a reducing and controlling language, with synchronized flight data, and weather reports as trusted structured data.