Dr. Hamidreza Baradaran Kashani Home Page

Assistant Professor at Artificial Intelligence Department, Faculty of Computer Engineering, University of Isfahan, Isfahan, Iran


Work Experience

  • Manager of Speech Processing Group
    Research Center for Development of Advanced Technologies (RCDAT) Tehran, Iran (2018-2019)
  • Machine Learning and Speech Processing Engineer
    Research Center for Development of Advanced Technologies (RCDAT) Tehran, Iran (2012-2019)

Educational History

  • PhD in Communication Engineering
    Department of Electrical Engineering, Amirkabir University of Technology, Tehran, Iran (2010-2017)
  • MSc in Communication Engineering
    Department of Electrical Engineering, Ferdowsi University of Mashhad, Mashhad, Iran (2007-2010)
  • BSc in Robotic Engineering
    Shahrood University of Technology, Shahrood, Iran (2003-2007)

Main Research Interests


  • Multimodal Deep Learning
  • Graph Neural Networks
  • Speaker Recognintion
  • Voice Conversion
  • Speech Enhancement
  • Speech Emotion Recognition
  • Text Summarization/Sentiment Analysis

Under Graduate

  • Fundamental of Robotics
    Fall 2019-Fall 2020
  • Signals and Systems
    Spring 2020-Spring 2021
  • Introduction to Natural Language Processing
    Spring 2021
  • Robotics Laboratory
    Spring 2020


  • Deep Neural Networks
    Spring 2023
  • Artificial Neural Networks
    Fall 2022
  • Natural Language Processing
    Spring 2020 – Spring 2021 – Spring 2022
  • Digital Speech Processing
    Fall 2020 - Fall 2021- Fall 2022

Selected Publications

  • Baradaran Kashani, H. & Jazmi, S ., "End-to-end deep speaker embedding learning using multi-scale attentional fusion and graph neural networks.," Expert Systems with Applications, 2023, 222, 119833.

    Title : End-to-end deep speaker embedding learning using multi-scale attentional fusion and graph neural networks (2023)

    As an attractive research in biometric authentication, Text Independent Speaker Verification (TI-SV) problem aims to specify whether two given unconstrained utterances come from the same speaker or not. As state-of-the-art solutions, end-to-end approaches using deep neural networks seek to learn a highly discriminative speaker embedding space. In this paper, we propose a novel end-to-end approach for speaker embedding learning by focusing on two crucial factors: speaker embedder architecture and objective function. The proposed module in the speaker embedder is composed of an Efficient Multi-resolution feature Representation (EMR) block followed by a Multi-scale Channel Attention Fusion (MCAF) block. The EMR effectively addresses the issue of fixed resolution convolutional kernels which commonly used in most embedder architectures. Moreover, the MCAF significantly improves the simple summation-based feature fusion used in residual embedder networks. Regarding the objective function, we conduct the speaker embedding space towards learning the embedding-to-embedding relations, in addition to only embedding-to-training class relations employed by most previous methods. So, we propose to employ a dynamic graph attention network, on top of the proposed embedder to learn all informative relations between embeddings, and then learn both embedder and graph-based networks in an end-to-end manner. We conduct various experiments on a large-scale benchmark dataset called VoxCeleb1&2. The effectiveness of all proposed components is verified through an ablation study. We show the superior or competitive performances of the proposed approach compared to seven well-known embedding architectures and 32 SV systems, regarding two evaluation metrics, EER and minDCF, as well as the number of embedder parameters. …

  • H. B. Kashani, S. Reza and I. S. Rezaei, "On Metric-based Deep Embedding Learning for Text-Independent Speaker Verification," 2020 6th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS), 2020, pp. 1-7, doi: 10.1109/ICSPIS51611.2020.9349565.

    Title : On Metric-based Deep Embedding Learning for Text-Independent Speaker Verification (2020)

    As a state-of-the-art solution for speaker verification problems, deep neural networks have been usefully employed for extracting speaker embeddings which represent speaker informative features. Objective functions, as the supervisors for the learning of discriminative embeddings, play a crucial role for this purpose. In this paper, motivated by the success of metric learning approaches, we investigate four newly proposed metrics in the literature, specifically for the speaker verification problem. For deeper comparisons, we consider these metrics from both main groups of metric-based objectives, i.e. instance-based and proxy-based ones. By considering embeddings as instances, the first group exploits the instance-to-instance relations, while the latter associates the instances to the proxies as representatives of training samples.Evaluations in terms of Equal Error Rate (EER) are conducted in two conventional …

  • Title: Speech Enhancement via Deep Spectrum Image Translation Network (2019)

    Quality and intelligibility of speech signals are degraded under additive background noise which is a critical problem for hearing aid and cochlear implant users. Motivated to address this problem, we propose a novel speech enhancement approach using a deep spectrum image translation network. To this end, we suggest a new architecture, called VGG19-UNet, where a deep fully convolutional network known as VGG19 is embedded at the encoder part of an image-to-image translation network, i.e. U-Net. Moreover, we propose a perceptuallymodified version of the spectrum image that is represented in Mel frequency and power-law non-linearity amplitude domains, representing good approximations of human auditory perception model. By conducting experiments on a real challenge in speech enhancement, i.e. unseen noise environments, we show that the proposed approach outperforms other enhancement …

  • Title: Image to Image Translation based on Convolutional Neural Network Approach for Speech Declipping (2019)

    Clipping, as a current nonlinear distortion, often occurs due to the limited dynamic range of audio recorders. It degrades the speech quality and intelligibility and adversely affects the performances of speech and speaker recognitions. In this paper, we focus on enhancement of clipped speech by using a fully convolutional neural network as U-Net. Motivated by the idea of image-to-image translation, we propose a declipping approach, namely U-Net declipper in which the magnitude spectrum images of clipped signals are translated to the corresponding images of clean ones. The experimental results show that the proposed approach outperforms other declipping methods in terms of both quality and intelligibility measures, especially in severe clipping cases. Moreover, the superior performance of the U-Net declipper over the well-known declipping methods is verified in additive Gaussian noise conditions.

  • Title: Sequential use of spectral models to reduce deletion and insertion errors in vowel detection (2018)

    From both perspectives of speech production and speech perception, vowels as syllable nuclei can be considered as the most significant speech events. Detection of vowel events from a speech signal is usually performed by a two-step procedure. First, a temporal objective contour (TOC), as a time-varying measure of vowel similarity, is generated from the speech signal. Second, vowel landmarks, as the places of vowel events, are extracted by locating prominent peaks of the TOC. In this paper, by employing some spectral models in a sequential manner, we propose a new framework that directly addresses three possible errors in the vowel detection problem, namely vowel deletion, consonant insertion, and vowel insertion. The proposed framework consists of three main steps as follows. At the first step, two solutions are proposed to essentially reduce the initial vowel deletion error. The first solution is to use the …

  • Title: Vowel detection using a perceptually-enhanced spectrum matching conditioned to phonetic context and speaker identity (2017)

    Vowel detection methods usually adopt a two-stage procedure for detecting vowel landmarks. First, a temporal objective contour (TOC), as a time-varying measure of vowel-likeness, is generated from the speech signal. Then, vowel landmarks are extracted by determining outstanding peaks of the TOC. By focusing on the TOC generation stage, this paper presents a new model based on some proposed components called matched filters (MFs). Extraction of the MFs and design of the MF-based model constitute our two main contributions. Motivated by the human auditory system, the MFs are extracted by applying a series of perceptually-based processing operations to the speech spectra of the voiced frames. Accordingly, any factor leading to the variation of the speech spectra will change the extracted MFs, too. So, it is necessary to condition the filters to the factors affecting their characteristics. Based on this fact …

  • One problem in background estimation is the inherent change in the background such as waving tree branches, water surfaces, camera shakes, and the existence of moving objects in every image. In this paper, a new method for background estimation is proposed based on function approximation in kernel domain. For this purpose, Weighted Kernel-based Learning Algorithm (WKLA) is designed. WKLA includes a weighted type of kernel least mean square algorithm with ability to function approximation in the presence of noise. So, the proposed background estimation method includes two stages: firstly, a novel algorithm for outlier detection namely Fuzzy Outlier Detector (FOD) is applied. Then obtained results are fed to the WKLA. The proposed approach can handle scenes containing moving backgrounds, gradual illumination changes, camera vibrations, and non-empty backgrounds. The qualitative results and …

  • Title: Kernel least mean square features For HMM-based signal recognition (2010)

    In this paper, an attempt is made to propose a new feature extraction method that is capable of capturing nonlinearities in signals. For this purpose, Kernel Least Mean Square KLMS (KLMS) method is used to extract features from signal and in order to evaluate it, Hidden Markov Model (HMM) is used to model extracted feature sequence and to recognize it from other models. In HMM, Gaussian Mixture Model is used. By introducing noise on signal, results showed that recognition rate in the same level of noise is good but in other SNR values it can degrade. It is also compared with Linear Predictive Coding (LPC). Results showed that in low noise level, the proposed feature extraction has better results but in high noise level LPC has better results.

  • Title: A novel approach in video scene background estimation (2010)

    This paper presents a novel method for background estimation in a video sequence from the function estimation point of view. The proposed algorithm, called Kernel-based Background Learning (KBL), is designed based on kernel machine joint with learning schemes. In order to estimate background using KBL algorithm, we first interpret foreground samples as outliers relative to the background ones and so propose an Outlier Separator (OS). Then, the obtained results of OS algorithm are employed in the KBL method in order to train and estimate background in each pixel. Experimental results show the high accuracy and effectiveness of the proposed method in background estimation and foreground detection for the scenes including moving backgrounds, camera shakes, and non-empty backgrounds.


PhD Students

  • Asghar Torki
  • Mina Khaksar
  • Aref Hammadi
  • Seyyed Mohammad Hatefi
  • Advisor
  • Ehsan Eslami


Msc Students

  • Ali Delfardi
  • Reza Shiri
  • Amir Reza Seddighin
  • Siavosh Jazmi
  • Mehdi Darooni
  • Mehrnoosh Alipour
  • Hajar Mazaheri

  • Advisor
  • Sara Azima
  • Elham Ghafori
  • Amir Ardalan Nikandish

Contact Me


Room No.308,Artificial Intelligence Department, University Of Isfahan, Isfahan, Iran