libopus/dnn/README.md

*a58d3d2aSXin Li# LPCNet
*a58d3d2aSXin Li
*a58d3d2aSXin LiLow complexity implementation of the WaveRNN-based LPCNet algorithm, as described in:
*a58d3d2aSXin Li
*a58d3d2aSXin Li- J.-M. Valin, J. Skoglund, [LPCNet: Improving Neural Speech Synthesis Through Linear Prediction](https://jmvalin.ca/papers/lpcnet_icassp2019.pdf), *Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, arXiv:1810.11846, 2019.
*a58d3d2aSXin Li- J.-M. Valin, U. Isik, P. Smaragdis, A. Krishnaswamy, [Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet](https://jmvalin.ca/papers/improved_lpcnet.pdf), *Proc. ICASSP*, arxiv:2106.04129, 2022.
*a58d3d2aSXin Li- K. Subramani, J.-M. Valin, U. Isik, P. Smaragdis, A. Krishnaswamy, [End-to-end LPCNet: A Neural Vocoder With Fully-Differentiable LPC Estimation](https://jmvalin.ca/papers/lpcnet_end2end.pdf), *Proc. INTERSPEECH*, arxiv:2106.04129, 2022.
*a58d3d2aSXin Li
*a58d3d2aSXin LiFor coding/PLC applications of LPCNet, see:
*a58d3d2aSXin Li
*a58d3d2aSXin Li- J.-M. Valin, J. Skoglund, [A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet](https://jmvalin.ca/papers/lpcnet_codec.pdf), *Proc. INTERSPEECH*, arxiv:1903.12087, 2019.
*a58d3d2aSXin Li- J. Skoglund, J.-M. Valin, [Improving Opus Low Bit Rate Quality with Neural Speech Synthesis](https://jmvalin.ca/papers/opusnet.pdf), *Proc. INTERSPEECH*, arxiv:1905.04628, 2020.
*a58d3d2aSXin Li- J.-M. Valin, A. Mustafa, C. Montgomery, T.B. Terriberry, M. Klingbeil, P. Smaragdis, A. Krishnaswamy, [Real-Time Packet Loss Concealment With Mixed Generative and Predictive Model](https://jmvalin.ca/papers/lpcnet_plc.pdf), *Proc. INTERSPEECH*, arxiv:2205.05785, 2022.
*a58d3d2aSXin Li- J.-M. Valin, J. Büthe, A. Mustafa, [Low-Bitrate Redundancy Coding of Speech Using a Rate-Distortion-Optimized Variational Autoencoder](https://jmvalin.ca/papers/valin_dred.pdf), *Proc. ICASSP*, arXiv:2212.04453, 2023. ([blog post](https://www.amazon.science/blog/neural-encoding-enables-more-efficient-recovery-of-lost-audio-packets))
*a58d3d2aSXin Li
*a58d3d2aSXin Li# Introduction
*a58d3d2aSXin Li
*a58d3d2aSXin LiWork in progress software for researching low CPU complexity algorithms for speech synthesis and compression by applying Linear Prediction techniques to WaveRNN. High quality speech can be synthesised on regular CPUs (around 3 GFLOP) with SIMD support (SSE2, SSSE3, AVX, AVX2/FMA, NEON currently supported). The code also supports very low bitrate compression at 1.6 kb/s.
*a58d3d2aSXin Li
*a58d3d2aSXin LiThe BSD licensed software is written in C and Python/Keras. For training, a GTX 1080 Ti or better is recommended.
*a58d3d2aSXin Li
*a58d3d2aSXin LiThis software is an open source starting point for LPCNet/WaveRNN-based speech synthesis and coding.
*a58d3d2aSXin Li
*a58d3d2aSXin Li# Using the existing software
*a58d3d2aSXin Li
*a58d3d2aSXin LiYou can build the code using:
*a58d3d2aSXin Li
*a58d3d2aSXin Li```
*a58d3d2aSXin Li./autogen.sh
*a58d3d2aSXin Li./configure
*a58d3d2aSXin Limake
*a58d3d2aSXin Li```
*a58d3d2aSXin LiNote that the autogen.sh script is used when building from Git and will automatically download the latest model
*a58d3d2aSXin Li(models are too large to put in Git). By default, LPCNet will attempt to use 8-bit dot product instructions on AVX\*/Neon to
*a58d3d2aSXin Lispeed up inference. To disable that (e.g. to avoid quantization effects when retraining), add --disable-dot-product to the
*a58d3d2aSXin Liconfigure script. LPCNet does not yet have a complete implementation for some of the integer operations on the ARMv7
*a58d3d2aSXin Liarchitecture so for now you will also need --disable-dot-product to successfully compile on 32-bit ARM.
*a58d3d2aSXin Li
*a58d3d2aSXin LiIt is highly recommended to set the CFLAGS environment variable to enable AVX or NEON *prior* to running configure, otherwise
*a58d3d2aSXin Lino vectorization will take place and the code will be very slow. On a recent x86 CPU, something like
*a58d3d2aSXin Li```
*a58d3d2aSXin Liexport CFLAGS='-Ofast -g -march=native'
*a58d3d2aSXin Li```
*a58d3d2aSXin Lishould work. On ARM, you can enable Neon with:
*a58d3d2aSXin Li```
*a58d3d2aSXin Liexport CFLAGS='-Ofast -g -mfpu=neon'
*a58d3d2aSXin Li```
*a58d3d2aSXin LiWhile not strictly required, the -Ofast flag will help with auto-vectorization, especially for dot products that
*a58d3d2aSXin Licannot be optimized without -ffast-math (which -Ofast enables). Additionally, -falign-loops=32 has been shown to
*a58d3d2aSXin Lihelp on x86.
*a58d3d2aSXin Li
*a58d3d2aSXin LiYou can test the capabilities of LPCNet using the lpcnet\_demo application. To encode a file:
*a58d3d2aSXin Li```
*a58d3d2aSXin Li./lpcnet_demo -encode input.pcm compressed.bin
*a58d3d2aSXin Li```
*a58d3d2aSXin Liwhere input.pcm is a 16-bit (machine endian) PCM file sampled at 16 kHz. The raw compressed data (no header)
*a58d3d2aSXin Liis written to compressed.bin and consists of 8 bytes per 40-ms packet.
*a58d3d2aSXin Li
*a58d3d2aSXin LiTo decode:
*a58d3d2aSXin Li```
*a58d3d2aSXin Li./lpcnet_demo -decode compressed.bin output.pcm
*a58d3d2aSXin Li```
*a58d3d2aSXin Liwhere output.pcm is also 16-bit, 16 kHz PCM.
*a58d3d2aSXin Li
*a58d3d2aSXin LiAlternatively, you can run the uncompressed analysis/synthesis using -features
*a58d3d2aSXin Liinstead of -encode and -synthesis instead of -decode.
*a58d3d2aSXin LiThe same functionality is available in the form of a library. See include/lpcnet.h for the API.
*a58d3d2aSXin Li
*a58d3d2aSXin LiTo try packet loss concealment (PLC), you first need a PLC model, which you can get with:
*a58d3d2aSXin Li```
*a58d3d2aSXin Li./download_model.sh plc-3b1eab4
*a58d3d2aSXin Li```
*a58d3d2aSXin Lior (for the PLC challenge submission):
*a58d3d2aSXin Li```
*a58d3d2aSXin Li./download_model.sh plc_challenge
*a58d3d2aSXin Li```
*a58d3d2aSXin LiPLC can be tested with:
*a58d3d2aSXin Li```
*a58d3d2aSXin Li./lpcnet_demo -plc_file noncausal_dc error_pattern.txt input.pcm output.pcm
*a58d3d2aSXin Li```
*a58d3d2aSXin Liwhere error_pattern.txt is a text file with one entry per 20-ms packet, with 1 meaning "packet lost" and 0 meaning "packet not lost".
*a58d3d2aSXin Linoncausal_dc is the non-causal (5-ms look-ahead) with special handling for DC offsets. It's also possible to use "noncausal", "causal",
*a58d3d2aSXin Lior "causal_dc".
*a58d3d2aSXin Li
*a58d3d2aSXin Li# Training a new model
*a58d3d2aSXin Li
*a58d3d2aSXin LiThis codebase is also meant for research and it is possible to train new models. These are the steps to do that:
*a58d3d2aSXin Li
*a58d3d2aSXin Li1. Set up a Keras system with GPU.
*a58d3d2aSXin Li
*a58d3d2aSXin Li1. Generate training data:
*a58d3d2aSXin Li   ```
*a58d3d2aSXin Li   ./dump_data -train input.s16 features.f32 data.s16
*a58d3d2aSXin Li   ```
*a58d3d2aSXin Li   where the first file contains 16 kHz 16-bit raw PCM audio (no header) and the other files are output files. This program makes several passes over the data with different filters to generate a large amount of training data.
*a58d3d2aSXin Li
*a58d3d2aSXin Li1. Now that you have your files, train with:
*a58d3d2aSXin Li   ```
*a58d3d2aSXin Li   python3 training_tf2/train_lpcnet.py features.f32 data.s16 model_name
*a58d3d2aSXin Li   ```
*a58d3d2aSXin Li   and it will generate an h5 file for each iteration, with model\_name as prefix. If it stops with a
*a58d3d2aSXin Li   "Failed to allocate RNN reserve space" message try specifying a smaller --batch-size for  train\_lpcnet.py.
*a58d3d2aSXin Li
*a58d3d2aSXin Li1. You can synthesise speech with Python and your GPU card (very slow):
*a58d3d2aSXin Li   ```
*a58d3d2aSXin Li   ./dump_data -test test_input.s16 test_features.f32
*a58d3d2aSXin Li   ./training_tf2/test_lpcnet.py lpcnet_model_name.h5 test_features.f32 test.s16
*a58d3d2aSXin Li   ```
*a58d3d2aSXin Li
*a58d3d2aSXin Li1. Or with C on a CPU (C inference is much faster):
*a58d3d2aSXin Li   First extract the model files nnet\_data.h and nnet\_data.c
*a58d3d2aSXin Li   ```
*a58d3d2aSXin Li   ./training_tf2/dump_lpcnet.py lpcnet_model_name.h5
*a58d3d2aSXin Li   ```
*a58d3d2aSXin Li   and move the generated nnet\_data.\* files to the src/ directory.
*a58d3d2aSXin Li   Then you just need to rebuild the software and use lpcnet\_demo as explained above.
*a58d3d2aSXin Li
*a58d3d2aSXin Li# Speech Material for Training
*a58d3d2aSXin Li
*a58d3d2aSXin LiSuitable training material can be obtained from [Open Speech and Language Resources](https://www.openslr.org/).  See the datasets.txt file for details on suitable training data.
*a58d3d2aSXin Li
*a58d3d2aSXin Li# Reading Further
*a58d3d2aSXin Li
*a58d3d2aSXin Li1. [LPCNet: DSP-Boosted Neural Speech Synthesis](https://people.xiph.org/~jm/demo/lpcnet/)
*a58d3d2aSXin Li1. [A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet](https://people.xiph.org/~jm/demo/lpcnet_codec/)
*a58d3d2aSXin Li1. Sample model files (check compatibility): https://media.xiph.org/lpcnet/data/