There are many problems related to the management ofmusical data that have not yet been solved. These arenow being extensively considered in the ﬁeld of music in-formation retrieval (MIR). Topics that should be includedwithin the scope of this discussion include the problem ofautomatically classifying musical instrument sounds andmusical phrases/styles, music representation and index-ing, estimating similarity of music using both perceptualand musicology criteria, problems of recognizing musicusing audio or semantic description, building up musicaldatabases, evaluation of MIR systems, intellectual prop-erty right issues, user interfaces, issues related to musi-cal styles and genres, language modeling for music, userneeds and expectations, auditory scene analysis, gesturecontrol over musical works, and others. Some of topicscontained within the notion of MIR are covered by theMPEG-7 standard, which deﬁnes description of the mul-timedia content in order to support better interpretation ofinformation. It should be stressed that solving these prob-lems needs human assistance and management.
The aim of this project is to show examples of MIRdomain applications, namely, automatic recognition ofmusical instruments. The goal of this project is to builda system that is able to identify the instrument(s) presentin a mixture of sounds, speciﬁcally an amplitude pannedstereo or multi-channel mix. The ideal system should beindependent of music material, instrument and panninginformation. This system consists of three major steps,namely pre-processing, feature extraction and classiﬁca-tion.
Pre-processing includes separating individual instru-ment or enhancing certain instrument(s). The ultimategoal of this step is to retain enough and proper infor-mation of speciﬁc instrument and meanwhile, reduce asmuch noise and interference from other sound emittingobjects as possible. Note that in many cases should not beused to evaluate machine listening systems such as the in-strument classiﬁcation system discussed in this report, isan ability to reproduce the sounds it hears. Reproductionoften requires much less information than reproduction,and although high-ﬁdelity reproduction may be a usefulfeature, it is in no way a requirement for good recognitionperformance (Martin (1999)).
This next step is feature extraction. There are variousfeatures that have been used in different instrument clas-siﬁcation systems such as onset, harmonic relationship,MFCC, wavelet and others. A review of various featuresis given in Section Historical Review. Because this sys-tem is designed to handle a mixture of sound, time domainfeatures is probably not a good choice as they are difﬁcultto separate and are easily distorted by the pre-processing.Hence this project uses frequency domain features thatinclude harmonics and Mel-Frequency Cepstrum Coefﬁ-cients (MFCC).
After the features are properly extracted, the systemthen tries to classify these features using the previouslytrained classiﬁer. There are various classiﬁcation tech-niques in different instrument classiﬁcation systems suchas ANN, Bayesian Classiﬁers and SVM. A review of vari-ous classiﬁers is given in Section Historical Review. Theclassiﬁer used in this project are Gaussian Mixture Modeland 1-Nearest Neighbor.
It is important to keep in mind the semantic differ-ences among the words classiﬁcation, identiﬁcation, andrecognition. Recognition describes a process of gatheringinformation and making inferences, and classiﬁcation in-volves the assignment of a category label. Identiﬁcation isused to describe recognition tasks in which the ”allowed”category choices are not pre-speciﬁed (Martin (1999)).However, in this report, these three terms basically referto the same meaning.
The perception of timbre by humans has been widely stud-ied over the past few decades and interest in automaticmusic processing, and especially in information extractionfrom audio ﬁles, has grown signiﬁcantly in recent years.To know what instruments play in a musical recording canbe useful for a number of applications and some of themare listed below.
1. Music annotation. Over the last two decades, dig-ital music have proliferated. Today, we have in-ternet search engines that can identify text docu-ments matching a users query, but multimedia doc-uments are opaque to search engines. Todays sys-tems have no way of discovering if a spoken phrasein a recording matches a query and retrieving the rel-evant document. Recently, efforts have begun thatwill result in standardized descriptors, or meta-dataformats, for multimedia data (MPEG RequirementsGroup, 1999). However, for most of the descriptorswe would like to usein queries such as ”ﬁnd the ca-denzas of all the Mozart concertos in the database,and sort them by instrument” we have no tools thatcan extract the relevant information automatically.The producer of the data must add the meta-databy hand. An automatic music instrument recogni-tion system could be used at the point of production,where sounds are often isolated on separate channelsof a multi-track recording system. Meta-data couldbe added before the sounds are mixed together andpreserved throughout the production process. Betteryet, recordings could be distributed in structured for-mats that preserve the isolation of individual soundsuntil the time of playback, and then techniques likethose described here could be applied by the end-user.
2. Music transcription. The process of listening to apiece of music and reconstructing the notated score isknown as transcription. More generally, transcriptionis the process of determining which musical noteswere played when (and by what instrument) in a mu-sical recording or performance. In the general caseof music played by multiple instruments (or a singlepolyphonic instrument such as a guitar or piano), thetask is one of polyphonic pitch tracking. This is ex-traordinarily difﬁculthumans require extensive train-ing in order to transcribe music reliably. However,because transcription is an important tool for musictheorists, music psychologists, and musicologistsnotto mention music lovers who want to ﬁgure out whattheir favorite artists are playing in rapid passagesitwould be wonderful to have tools that could aid thetranscription process, or automate it entirely. State-of-the-art polyphonic pitch tracking research demon-strates that the task is made simpler if goodand ex-plicitmodels of the sound sources (the musical instru-ments) are available (Kashino and Murase (1998b)).By integrating instrument (and voice) recognitionwith a transcription engine, the end result can be im-proved dramatically.
3. Structured-audio encoding. As noted above,structured-media formats make automatic multime-dia annotation easier. In addition, they give the enduser more control over the media playback. For ex-ample, an audio enthusiast could take better advan-tage of a seven-speaker playback setup if the au-dio material was not pre-mixed for stereo playback.Movie soundtracks could include speech tracks inmultiple languages, enabling distributors to provideonly one version of a movie for international pre-sentation. Amateur musicians could mute a partic-ular part of a recording and play along. Althoughstructured formats provide immense advantages overtheir non-structured counterparts (such as the currentgeneration of compact discs and videotapes), we cur-rently have no way of automatically adding struc-ture to an unstructured recording. In the future, bycombining robust tools from instrument (and voice)recognition, CASA, music transcription, and speechrecognition, it may be possible to build fully or partlyautomated tools for unstructured- to-structured en-coding.
4. Synthetic listeners and performers. Endowingcomputer systems with the ability to recognizesounds and understand the information they conveywould enable a host of exciting applications. Wecould build virtual music instructors (with unendingpatience!), virtual orchestras to conduct, and virtualperformers to jam with.
1.2 State of The Art
Previous work in automatic identiﬁcation of musical in-struments has mainly focused on monophonic recordings(Herrera et al. (2000), Martin (1999), Kostek (2004)).While good results have been achieved for both isolatedtones and recordings from commercially available com-pact discs, these studies assume that only one instrumentis present at any moment in time.
Only very few researchers have attempted instrumentrecognition in polyphonic music, for examples Jincahi-tra (2004) and Eggink and Brown (2004). Polyphonicclassiﬁcation can be categorized as classiﬁcation with orwithout separation. Classiﬁcation systems without sepa-ration normally target at certain types of music materialor a restricted number of simultaneous notes, for exam-ple, system by Eggink and Brown (2004) is to identify asolo instrument even in the presence of an accompany-ing keyboard instrument or orchestra, and systems pro-posed by Eggink and Brown (2003), Kashino and Murase(1999), Kinoshita et al. (1999) were tested only with, typ-ically,2 or 3 notes simultaneous notes and relied on identi-ﬁying the fundamental frequency of every tone. Jincahitra(2004) uses the idea of ICA-BSS to extract features suchas MFCC and LPC from statistically independent compo-nents in a sound mixture without actually separating thesignals.
1.3 Polyphonic Instrument Identiﬁcation with Separation
The goal of this project is to build a system that is able toidentify the instrument(s) present in a mixture of sounds,speciﬁcally an amplitude panned stereo or multi-channelmix. This system consists of three major steps, namelypre-processing, feature extraction and classiﬁcation.Pre-processing includes separating individual instru-ment or enhancing certain instrument. The ultimate goalof this step is to retain enough and proper information ofspeciﬁc instrument and meanwhile, reduce as much noiseand interference from other sound emitting objects as pos-sible. This project uses a frequency domain unmix tech-nique proposed by Avendano and Jot (2004) that attemptsto unmix the instrument panned to different angle in a am-plitude panned stereo mix. This technique ﬁrst calculatesa panning index of each point on the Short-Time FourierTransform (STFT) plain and then groups points with samepanning index and calls each group of points a separatedinstrument.
This next step is feature extraction. Because this sys-tem is designed to handle a mixture of sound, time domainfeatures is probably not good choice as they are difﬁcultto separate and are easily distorted by the pre-processing.Hence this project uses frequency domain features thatinclude harmonics and Mel-Frequency Cepstrum Coefﬁ-cients (MFCC).
After the features are properly extracted, the systemthen tries to classify these features using the previouslytrained classiﬁer. The classiﬁer used in this project areGaussian Mixture Model and 1-Nearest Neighbor.
The rest of this report is organized as follows. Thenext section gives a historical review. The followingsection discusses the Pre-processing or unmix based onalgorithms by Avendano and Jot (2004). Then the fea-tures to be used and extraction methods are given in thesection Feature Extraction, followed by Classiﬁcationthat describes various classiﬁcation techniques used in thisproject. System describes the system architecture, variousissues and solutions during system integration. Finally theconclusion and future work, is discussed in the last twosection.