Signal Processing Stack Exchange is a question and answer site for practitioners of the art and science of signal, image and video processing. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I am currently working on recreating the result of this paper. The paper is about applying cnn in speech recognition, in which cnn is used to for feature extraction, for which a proper way of representing the feature is needed such that it can it easily can detect them.

On page 3 section III subsection A : Organization of the Input Data to the CNN

Is it stated that the how the in paper are representing the input data, they use there so-called MFSC features, which are MFCC features without the DCT performed. It is then on next page stated that they make use of the static, delta and delta delta and plot them next you each other for each frame as such [Static Delta Delta Delta], and create a spectogram of this.

There exist several different alternatives to organizing these MFSC features into maps for the CNN. First, as shown in Fig. 1(b), they can be arranged as three 2-D feature maps, each of which represents MFSC features (static, delta and delta-delta) distributed along both frequency (using the frequency band index) and time (using the frame number within each context window). In this case, a two-dimensional convolution is performed (explained below) to normalize both frequency and temporal variations simultaneously. Alternatively, we may only consider normalizing frequency variations. In this case, the same MFSC features are organized as a number of one-dimensional (1-D) feature maps (along the frequency band index), as shown in Fig. 1(c). For example, if the context window contains 15 frames and 40 filter banks are used for each frame, we will construct 45 (i.e., 15 times 3) 1-D feature maps, with each map having 40 dimensions, as shown in Fig. 1(c). As a result, a one-dimensional convolution will be applied along the frequency axis. In this paper, we will only focus on this latter arrangement found in Fig. 1(c), a one-dimensional convolution along frequency

I tried doing this but seem to get a pretty weird spectogram..

Here is only the static:

enter image description here

and Here is [static delta delta delta]

enter image description here

I would expect it be a bit different, but not this much.. it looks like the data is placed incorrectly, or is it supposed to be like this?

share|improve this question

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.