AudioSimilarity use the Mel-Frequency Cepstral Coefficients (MFCC).
Here is a list of process steps.
- Getting music sampling (WAV format) in an array
- Creating 2 seconds windows interlace with 1 second each other
- Running Hann Window
- Calculating the FFT : Fast Fourier Transform to get the frequencies
- Filtring with 58 triangular filters spaced by Mel scale
- Calculating distance of these 57 coefficients (the first is not reprensentative) between all the different musics
I'm aware that right now for a first version, some results are great and others are less. If you have any tips or ideas, tell me, because there aren't so much information on it. It's difficult to know if we are going in the right direction.
The script finds musics which the minor differences, like without one instrument, or a shorter track. Sometimes, it changes the genre radicaly, from a calm sound to an other one more rhythmic.
This problem might be due to :
The windows duration (2 seconds interlace of 1 second) too short or too long ?
Doing an average on the coefficients of each windows can be loose information ?
(method: AudioSimilarity.AudioSimilarity.mfcc())What is the best ? Analysing at 25%, at 50%, then at 75% of the track and comparing 3*57 coefficients ?
What do we do about the silence in track ? Currently, the script ignore the silence in the begging of the track, but it's a pure silence (value is 0). Adding a threshold ?
Are 57 coefficients enought ? More or less ?
Favoring low-pitched or high-pitched sound for filters ?
Using more coefficients on Mel scale than on linear scale, for low-pitched ?
Finding another way to calculate the distance between two musics ?
I have so much looking for information on Internet that I didn't remember all my sources, but I will try to regroup them :