Software Development Blog

Exploration of Big Data, Machine Learning, Nature Language Processing, and other fun projects!

Convert Podcast Audio to Text with Timestamps with C# and Azure Speech API

In this article, we'll show to to get Text and Timestamps for an audio podcast using C# and the Azure Speech API.

(If you are interested in implementing this in a large scale and want development help, please contact me with details. I may be interested in helping you execute your plan.)

 

First, we'll refernce two previous blog articles as building blocks for the final project:

1) Use this article to setup an Azure Speech API and obtain your private key:

Convert Speech Data in Audio Files to Text with C# and Azure Speech API

2) Use this article to setup a function that splits your podcast audio in to sub-10 minute chunks that the Azure Speech API can consume. Most podcasts are 1 to 2 hours in length, and the Azure Speech API can not consume that directly.

Using NAudio and C# to split Mp3 Audio Files

 

Next, we'll introduce the Microsoft.Bing.Speech Nuget package. Install that to your Visual Studio C# project. It looks like this:

 

Then, we'll create a class to hold our output text and the timestamp that corresponds to the result of the Azure Speech API.

public class RecognitionPhraseResult
{
	public Confidence Confidence { get; set; }
	public string DisplayText { get; set; }
	public string LexicalForm { get; set; }
	public uint MediaDuration { get; set; }
	public ulong MediaTime { get; set; }
}

List<RecognitionPhraseResult> finalResults = new List<RecognitionPhraseResult>();

Then, we'll populate our final results collection in the OnRecognitionResult function, which is called each time the Speech API yields a chunk of text.

public Task OnRecognitionResult(RecognitionResult args)
{
	var response = args;
	Console.WriteLine();

	Console.WriteLine("--- Phrase result received by OnRecognitionResult ---");

	// Print the recognition status.
	Console.WriteLine("***** Phrase Recognition Status = [{0}] ***", response.RecognitionStatus);
	if (response.Phrases != null)
	{
		if(response.Phrases.Count > 0)
		{
			RecognitionPhraseResult resultItem = new RecognitionPhraseResult();
			resultItem.Confidence = response.Phrases[0].Confidence;
			resultItem.DisplayText = response.Phrases[0].DisplayText;
			resultItem.LexicalForm = response.Phrases[0].LexicalForm;
			resultItem.MediaDuration = response.Phrases[0].MediaDuration;
			resultItem.MediaTime = response.Phrases[0].MediaTime + CurrentAudioSplitTimeOffsetTotalMillisecond * MsPerHundNanoSec;

			finalResults.Add(resultItem);
		}

		foreach (var result in response.Phrases)
		{
			// Print the recognition phrase display text.
			Console.WriteLine("{0} (Confidence:{1})", result.DisplayText, result.Confidence);
			Console.WriteLine(result.MediaDuration);
			Console.WriteLine(result.MediaTime);
		}
	}

	Console.WriteLine();
	return CompletedTask;
}

 

Next, we'll use the Microsoft.Bing.Speech Nuget package to setup a speech client that we call to get the text output. This code also shows the call to split the large podcast mp3 audio in to smaller sections. The foreach loops on the audio splits and passes each to the Speech API.

Uri serviceUrl = new Uri(@"wss://speech.platform.bing.com/api/service/recognition/continuous");
var preferences = new Preferences("en-US", serviceUrl, new CognitiveServicesAuthorizationProvider("YOUR_AZURE_KEY_HERE"));

// Create a a speech client
using (var speechClient = new SpeechClient(preferences))
{
	//speechClient.SubscribeToPartialResult(this.OnPartialResult);
	speechClient.SubscribeToRecognitionResult(this.OnRecognitionResult);


	List<AudioSplitOutput> audioSplits = SplitMp3File("Split00001", audioFile, 5000000, @"C:\temp");

	foreach(var audioSplit in audioSplits)
	{
		CurrentAudioSplitTimeOffsetTotalMillisecond = audioSplit.AudioTimeOffsetTotalMilliseconds;

		using (var reader = new Mp3FileReader(audioSplit.FileName))
		{
			using (Stream audioOutStream = new MemoryStream())
			{
				using (var writer = new WaveFileWriter(audioOutStream, new WaveFormat()))
				{
					var buf = new byte[4096];
					for (;;)
					{
						var cnt = reader.Read(buf, 0, buf.Length);
						if (cnt == 0) break;
						writer.WriteData(buf, 0, cnt);
					}

					audioOutStream.Seek(0, SeekOrigin.Begin);

					var deviceMetadata = new DeviceMetadata(DeviceType.Near, DeviceFamily.Desktop, NetworkType.Ethernet, OsName.Windows, "1607", "Dell", "T3600");
					var applicationMetadata = new ApplicationMetadata("SampleApp", "1.0.0");
					var requestMetadata = new RequestMetadata(Guid.NewGuid(), deviceMetadata, applicationMetadata, "SampleAppService");

					await speechClient.RecognizeAsync(new SpeechInput(audioOutStream, requestMetadata), this.cts.Token).ConfigureAwait(false);
				}

			}
		}
	}
}

 

The Azure Speech API only accepts WAV audio format, so each of the sub-10 minute audio MP3 splits are converted to WAV format before the data is passed to the Azure Speech API.

 

Lastly, your final results collection of RecognitionPhraseResult objects can be used at the end to be stored or presented as per your needs. My test application converted that list to json and wrote it to a file for some basic persistence. The code was for that is as follows:

var finalResultStr = JsonConvert.SerializeObject(finalResults);

using (System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"finalResult.txt"))
{
	outputFile.Write(finalResultStr);
}

 

An example of my final output for an example podcast can be found at StyleMyImage.com/CodingBlocks

 

Using NAudio and C# to split Mp3 Audio Files

I have the need to split some large podcast audio files in to smaller 10 minute chunks in order to use Azure's Speech-to-Text API. We'll use NAudio and Visual Studio C# to accomplish this.

 

First, install the Nuget package for NAudio and NAudio.Lame to your Visual Studio project.

Next, we'll create a class to hold the filenames and media offset of our split files.

public class AudioSplitOutput
{
	public string FileName;
	public ulong AudioTimeOffsetTotalMilliseconds = 0;
}

 

Then, we'll create a function to take in a source audio mp3 file, a destination path, a base name for the splits, and returns a collection of the AudioSplitOutput objects that it creates.

private static List<AudioSplitOutput> SplitMp3File(string baseNameForSplits, string sourceFileName, string destinationPath)
{
	List<AudioSplitOutput> outputAudioSplitList = new List<AudioSplitOutput>();

	int splitLength = 480; // seconds

	int secsOffset = 0;
	int splitIndex = 0;

	using (var reader = new Mp3FileReader(sourceFileName))
	{

		FileStream writer = null;
		Action createWriter = new Action(() => {
			string newBaseNameForSplit = baseNameForSplits + "-" + (++splitIndex).ToString();
			string newFileName = Path.Combine(destinationPath, newBaseNameForSplit + ".mp3");
			if(File.Exists(newFileName))
			{
				File.Delete(newFileName);
			}
			writer = File.Create(newFileName);

			AudioSplitOutput audioSplitOutput = new AudioSplitOutput();
			audioSplitOutput.FileName = newFileName;
			audioSplitOutput.AudioTimeOffsetTotalMilliseconds = ulong.Parse(reader.CurrentTime.TotalMilliseconds.ToString());
			outputAudioSplitList.Add(audioSplitOutput);
		});

		Mp3Frame frame;
		while ((frame = reader.ReadNextFrame()) != null)
		{
			if (writer == null) createWriter();

			if ((int)reader.CurrentTime.TotalSeconds - secsOffset >= splitLength)
			{
				// time for a new file
				writer.Dispose();
				createWriter();
				secsOffset = (int)reader.CurrentTime.TotalSeconds;
			}

			writer.Write(frame.RawData, 0, frame.RawData.Length);
		}
		if (writer != null) writer.Dispose();
	}
	return outputAudioSplitList;
}

 

Notice how the reader.CurrentTime is checked after reading MP3 frames. When the current time exceeds the split threshold, a new MP3 split file will be created via the Action createWriter.

 

You could also split based on MP3 file size. In that case, you'd want to accumulate the frame.RawData.Length value in to a variable and check that instead of the reader.CurrentTime. When your accumulated written MP3 data exceeds your threshold, then split your writer in the same way as above (with a call to the createWriter action).

Convert Speech Data in Audio Files to Text with C# and Azure Speech API

I've been listening to many podcasts recently and they have hundreds of hours of content all stored in audio files as speech. How could we begin to index or search them?


A solution to searching podcast audio is to convert the audio files to text with associated time stamps. We'll index the text and then be able to search the content and retrieve the text and audio slices we are looking for.


First, create a Visual Studio project (console command line is fine), and install the ProjectOxford.SpeechRecognition nuget package that looks like this:



Next, spin up a Bing Speech API service in Azure. You can search for it with the "speech" keyword, and it looks like this:



After your Speech to Text service spins up, you will want to get the access keys for it. On the dashboard for your instance, click the "Show access keys ..." link that looks like this. You'll want to copy the access key value in to your app.config or code that you use at run time.



Here's a code example to create a service handle from its factory, and send audio data to the service from your source audio file. Be sure to change the audio file name and your service API key in the code example below.

string key = "YOUR_KEY_GOES_HERE";
Console.WriteLine("Key provided is: {0}", key);
Console.Write("Please provide file: ");
string file = @"C:\MyFavoritePodCast.mp3";
Console.WriteLine("File provided is: {0}", file);
var defaultLocale = "en-US";
var mode = SpeechRecognitionMode.LongDictation;
using (var dataClient = SpeechRecognitionServiceFactory.CreateDataClient(mode, defaultLocale, key))
{
// Event handlers for speech recognition results
dataClient.OnResponseReceived += OnDataDictationResponseReceivedHandler;
dataClient.OnPartialResponseReceived += OnPartialResponseReceivedHandler;
dataClient.OnConversationError += OnConversationErrorHandler;
using (var fileStream = new FileStream(file, FileMode.Open, FileAccess.Read))
{
Console.Write("Processing File");

var bytesRead = 0;
var buffer = new byte[1024];
try
{
do
{
// Get more Audio data to send into byte buffer.
bytesRead = fileStream.Read(buffer, 0, buffer.Length);
// Send of audio data to service. 
dataClient.SendAudio(buffer, bytesRead);
}
while (bytesRead > 0);
}
finally
{
// We are done sending audio.  Final recognition results will arrive in OnResponseReceived event call.
dataClient.EndAudio();
}
Console.WriteLine();
Console.Write("Waiting for response");
// Big sleep to ensure async requests complete.
Thread.Sleep(25000000);
}
}


In the Speech to Text code example above we setup some event handler functions, so we need to implement them. Here is an example of how they would look and how you could print out the text received as output from the Speech to Text service.


private static void OnDataDictationResponseReceivedHandler(object sender, SpeechResponseEventArgs e)
{
Console.WriteLine();
Console.WriteLine("--- OnDataDictationResponseReceivedHandler ---");
switch (e.PhraseResponse.RecognitionStatus)
{
case RecognitionStatus.EndOfDictation:
case RecognitionStatus.DictationEndSilenceTimeout:
Console.WriteLine("Completed");
break;
}
WriteResponseResult(e);
}
private static void WriteResponseResult(SpeechResponseEventArgs e)
{
Console.WriteLine();
if (e.PhraseResponse.Results.Length == 0)
{
Console.WriteLine("No phrase response is available.");
}
else
{
Console.WriteLine("########## Final n-BEST Results ##############");
for (int i = 0; i < e.PhraseResponse.Results.Length; i++)
{
Console.WriteLine(
"[{0}] Confidence={1}, Text=\"{2}\"",
i,
e.PhraseResponse.Results[i].Confidence,
e.PhraseResponse.Results[i].DisplayText);
}
Console.WriteLine();
}
done = true;
}
private static void OnIntentHandler(object sender, SpeechIntentEventArgs e)
{
Console.WriteLine();
Console.WriteLine("--- Intent received by OnIntentHandler() ---");
Console.WriteLine("{0}", e.Payload);
Console.WriteLine();
}
private static void OnPartialResponseReceivedHandler(object sender, PartialSpeechResponseEventArgs e)
{
Console.WriteLine();
Console.WriteLine("--- Partial result received by OnPartialResponseReceivedHandler() ---");
Console.WriteLine("{0}", e.PartialResult);
Console.WriteLine();
done = true;
}
private static void OnConversationErrorHandler(object sender, SpeechErrorEventArgs e)
{
Console.WriteLine();
Console.WriteLine("--- Error received by OnConversationErrorHandler() ---");
Console.WriteLine("Error code: {0}", e.SpeechErrorCode.ToString());
Console.WriteLine("Error text: {0}", e.SpeechErrorText);
Console.WriteLine();
done = true;
}


You will certainly want to handle the text outputs from the Speech to Text audio service with some persistence and check for errors, but this gives a great starting example of how to setup the functions and process the arguments.


Kaggle: Zillow's Zestimate competition: C# Vowpal Wabbit training and prediction - Part3

Part 1 and 2 of my Kaggle: Zillow's Zestimate competition blog have been about reading the source data files and preprocessing the records. This is Part 3 and will focus on training and prediction of the data using Vowpal Wabbit.


Be sure to install the Vowpal Wabbit nuget package to your visual studio C# project. It should look like this:


Let's create a VWRecord class that is essentially the same as our Parcel class from Part 1 of this article set, but the fields are annotated with Vowpal Wabbit name space markups.

    public class VWRecord
    {
        [Feature(FeatureGroup = 'a')]
        public float airconditioningtypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float architecturalstyletypeid { get; set; }
        public float basementsqft { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float bathroomcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float bedroomcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float buildingclasstypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float buildingqualitytypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float calculatedbathnbr { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float decktypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float finishedfloor1squarefeet { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float calculatedfinishedsquarefeet { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float finishedsquarefeet12 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float finishedsquarefeet13 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float finishedsquarefeet15 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float finishedsquarefeet50 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float finishedsquarefeet6 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float fips { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float fireplacecnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float fullbathcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float garagecarcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float garagetotalsqft { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float hashottuborspa { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float heatingorsystemtypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float latitude { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float longitude { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float lotsizesquarefeet { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float poolcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float poolsizesum { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float pooltypeid10 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float pooltypeid2 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float pooltypeid7 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public string propertycountylandusecode { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float propertylandusetypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public string propertyzoningdesc { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float rawcensustractandblock { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float regionidcity { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float regionidcounty { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float regionidneighborhood { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float regionidzip { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float roomcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float storytypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float threequarterbathnbr { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float typeconstructiontypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float unitcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float yardbuildingsqft17 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float yardbuildingsqft26 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float yearbuilt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float numberofstories { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float fireplaceflag { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float structuretaxvaluedollarcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float taxvaluedollarcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float assessmentyear { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float landtaxvaluedollarcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float taxamount { get; set; }
        [Feature(FeatureGroup = 'a')]
        public string taxdelinquencyflag { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float taxdelinquencyyear { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float censustractandblock { get; set; }
    }


Then, we create a wrapper we use to call the Vowpal Wabbit API and hold the VW engine instance handle. You can extend the Init function to try various loss functions, learning parameters, and other VW tuning approaches. A good first loss function for the Zillow's Zestimate Kaggle competition is quantile.

    public class VWWrapper
    {
        VW.VowpalWabbit<VWRecord> vw = null;

        public void Init()
        {
            string vwArgs = string.Join(" "
                , "-f vw.model"
                //, "--loss_function=squared"
                , "--loss_function=quantile"
                //, "--loss_function=hinge"
                //, "--loss_function=logistic"
                , "--progress 10000"
                //, "--learning_rate " + learningRate
                //, "--power_t " + powerRates
                //, "--l2 " + l2Value
                //, "--binary"
                , "-b 27"
                );

            vw = new VW.VowpalWabbit<VWRecord>(new VowpalWabbitSettings
            {
                EnableStringExampleGeneration = true,
                Verbose = true,
                Arguments = vwArgs
            });
        }
        public VowpalWabbitPerformanceStatistics GetStats()
        {
            return vw.Native.PerformanceStatistics;
        }
        public VWRecord GetVwRecord(Parcel parcel)
        {
            VWRecord vwRecord = new VWRecord();
            vwRecord.airconditioningtypeid = parcel.airconditioningtypeid;
            vwRecord.architecturalstyletypeid = parcel.architecturalstyletypeid;
            vwRecord.basementsqft = parcel.basementsqft;
            vwRecord.bathroomcnt = parcel.bathroomcnt;
            vwRecord.bedroomcnt = parcel.bedroomcnt;
            vwRecord.buildingclasstypeid = parcel.buildingclasstypeid;
            vwRecord.buildingqualitytypeid = parcel.buildingqualitytypeid;
            vwRecord.calculatedbathnbr = parcel.calculatedbathnbr;
            vwRecord.decktypeid = parcel.decktypeid;
            vwRecord.calculatedfinishedsquarefeet = parcel.calculatedfinishedsquarefeet;
            vwRecord.finishedsquarefeet12 = parcel.finishedsquarefeet12;
            vwRecord.finishedsquarefeet13 = parcel.finishedsquarefeet13;
            vwRecord.finishedsquarefeet15 = parcel.finishedsquarefeet15;
            vwRecord.finishedsquarefeet50 = parcel.finishedsquarefeet50;
            vwRecord.finishedsquarefeet6 = parcel.finishedsquarefeet6;
            vwRecord.fips = parcel.fips;
            vwRecord.fireplacecnt = parcel.fireplacecnt;
            vwRecord.fullbathcnt = parcel.fullbathcnt;
            vwRecord.garagecarcnt = parcel.garagecarcnt;
            vwRecord.garagetotalsqft = parcel.garagetotalsqft;
            vwRecord.hashottuborspa = parcel.hashottuborspa;
            vwRecord.heatingorsystemtypeid = parcel.heatingorsystemtypeid;
            vwRecord.latitude = parcel.latitude;
            vwRecord.longitude = parcel.longitude;
            vwRecord.lotsizesquarefeet = parcel.lotsizesquarefeet;
            vwRecord.poolcnt = parcel.poolcnt;
            vwRecord.poolsizesum = parcel.poolsizesum;
            vwRecord.pooltypeid10 = parcel.pooltypeid10;
            vwRecord.pooltypeid2 = parcel.pooltypeid2;
            vwRecord.pooltypeid7 = parcel.pooltypeid7;
            vwRecord.propertycountylandusecode = parcel.propertycountylandusecode;
            vwRecord.propertylandusetypeid = parcel.propertylandusetypeid;
            vwRecord.propertyzoningdesc = parcel.propertyzoningdesc;
            vwRecord.rawcensustractandblock = parcel.rawcensustractandblock;
            vwRecord.regionidcity = parcel.regionidcity;
            vwRecord.regionidcounty = parcel.regionidcounty;
            vwRecord.regionidneighborhood = parcel.regionidneighborhood;
            vwRecord.regionidzip = parcel.regionidzip;
            vwRecord.roomcnt = parcel.roomcnt;
            vwRecord.storytypeid = parcel.storytypeid;
            vwRecord.threequarterbathnbr = parcel.threequarterbathnbr;
            vwRecord.typeconstructiontypeid = parcel.typeconstructiontypeid;
            vwRecord.unitcnt = parcel.unitcnt;
            vwRecord.yardbuildingsqft17 = parcel.yardbuildingsqft17;
            vwRecord.yardbuildingsqft26 = parcel.yardbuildingsqft26;
            vwRecord.yearbuilt = parcel.yearbuilt;
            vwRecord.numberofstories = parcel.numberofstories;
            vwRecord.fireplaceflag = parcel.fireplaceflag;
            vwRecord.structuretaxvaluedollarcnt = parcel.structuretaxvaluedollarcnt;
            vwRecord.taxvaluedollarcnt = parcel.taxvaluedollarcnt;
            vwRecord.assessmentyear = parcel.assessmentyear;
            vwRecord.landtaxvaluedollarcnt = parcel.landtaxvaluedollarcnt;
            vwRecord.taxamount = parcel.taxamount;
            vwRecord.taxdelinquencyflag = parcel.taxdelinquencyflag;
            vwRecord.taxdelinquencyyear = parcel.taxdelinquencyyear;
            vwRecord.censustractandblock = parcel.censustractandblock;
            return vwRecord;
        }
        public void Train(Parcel parcel, float label)
        {
            VWRecord vwRecord = GetVwRecord(parcel);
            SimpleLabel simpleLabel = new SimpleLabel() { Label = label };
            // Comment this in if you want to see the VW serialized input records:
            //var str = vw.Serializer.Create(vw.Native).SerializeToString(vwRecord, simpleLabel);
            //Console.WriteLine(str);
            vw.Learn(vwRecord, simpleLabel);
        }

        public float Predict(Parcel parcel)
        {
            VWRecord vwRecord = GetVwRecord(parcel);
            return vw.Predict(vwRecord, VowpalWabbitPredictionType.Scalar);
        }

        public void SaveModel()
        {
            vw.Native.SaveModel();
        }
    }

Notice the GetVwRecord function that maps the Parcel class to the VWRecord class. The VWRecord class (and its annotations) are needed to call Predict and Learn on the VW engine instance.


Please refer back to Part 2 of this article series and insert a vwWrapper.Train call in the transactionList training loop like this:

Parcel parcel = null;
if (parcelMap.TryGetValue(transactionTrain.parcelid, out parcel))
{
vwWrapper.Train(parcel, transactionTrain.logerror);
}
else
{
Console.WriteLine("ERROR: TRAIN: Failed to find parcelMap item for parcel id: " + transactionTrain.parcelid);
}


You'll also want to insert a vwWrapper.Predict call in the predictionList loop like this:

Parcel parcel = null;
if (parcelMap.TryGetValue(prediction.parcelid, out parcel))
{
predictedValue = vwWrapper.Predict(parcel);
prediction.LogErr201610 = predictedValue;
prediction.LogErr201611 = predictedValue;
prediction.LogErr201612 = predictedValue;
prediction.LogErr201710 = predictedValue;
prediction.LogErr201711 = predictedValue;
prediction.LogErr201712 = predictedValue;
}
else
{
Console.WriteLine("ERROR: TEST: Failed to find parcelMap item for parcel id: " + prediction.parcelid);
}


At this point, you should be able to use the code from Part 1, 2, and 3 of this blog article series to completely train and predict a submission for the $1.2M Zillow's Zestimate Kaggle competition using Vowpal Wabbit and C#. 


The total run time of this code solution takes under 10 minutes on a very modest Windows laptop.

Kaggle: Zillow's Zestimate competition: C# Classes for reading source data files - Part2

In my previous Kaggle: Zillow's Zestimate competition article (Part1), we loaded up the Parcel data source file in to a Dictionary map.


Now we will process the Transaction and Prediction data files. We'll start by making classes to hold the rows in Transaction and Prediction data files.

    public class Prediction
    {
        // ParcelId,201610,201611,201612,201710,201711,201712
        public int parcelid;
        public float LogErr201610;
        public float LogErr201611;
        public float LogErr201612;
        public float LogErr201710;
        public float LogErr201711;
        public float LogErr201712;
    }
    public class Transaction
    {
        // parcelid,logerror,transactiondate
        public int parcelid;
        public float logerror;
        public DateTime transactiondate;
    }


Initially, I tried to put these two files in a Dictionary map, but there were duplicate parcel ID values in both the training and prediction data sets. We'll be iterating through them, but we don't really need a parcel ID lookup for them (yet). Due to that, we'll load these two data sources in to a simple List.

string train_2016 = @"C:\kaggle\zillow\train_2016.csv";
string sample_submission = @"C:\kaggle\zillow\sample_submission.csv";
List<Prediction> predictionList = dataSource.GetPredictionList(sample_submission);
List<Transaction> transactionList = dataSource.GetTransactionList(train_2016);


In the GetPredictionList and GetTransactionList, open a StreamReader to these two files and loop through the lines like this:

List<Transaction> output = new List<Transaction>();
while (!fileReader.EndOfStream)
{
try
{
//Processing row
string line = fileReader.ReadLine();
string[] fields = line.Split(',');
Transaction row = new Transaction();
int.TryParse(fields[0], out row.parcelid);
float.TryParse(fields[1], out row.logerror);
DateTime.TryParse(fields[2], out row.transactiondate);
output.Add(row);


Parsing the sample_submission.csv file is not very interesting since the example log error values are all 0. We need to process this list to get a list of prediction parcel IDs that we need. The test set parcel IDs are not provided anywhere else, so we obtain them from the sample_submission.csv file.

                    Prediction row = new Prediction();
                    int.TryParse(fields[0], out row.parcelid);
                    float.TryParse(fields[1], out row.LogErr201610);
                    float.TryParse(fields[2], out row.LogErr201611);
                    float.TryParse(fields[3], out row.LogErr201612);
                    float.TryParse(fields[4], out row.LogErr201710);
                    float.TryParse(fields[5], out row.LogErr201711);
                    float.TryParse(fields[6], out row.LogErr201712);


Next, we'll iterate through our training set and pick out the associated parcel properties to a training transaction case:

foreach (var transactionTrain in transactionList)
{
Parcel parcel = null;
if (parcelMap.TryGetValue(transactionTrain.parcelid, out parcel))
{
// Train record here
}
else
{
Console.WriteLine("ERROR: TRAIN: Failed to find parcelMap item for parcel id: " + transactionTrain.parcelid);
}
}


Lastly, we'll loop through the prediction list and pick out associated parcel properties. We'll call a (not yet implemented) prediction function, and assign that prediction value to all of the Log Error fields.

foreach (var prediction in predictionList)
{
Parcel parcel = null;
if (parcelMap.TryGetValue(prediction.parcelid, out parcel))
{
float predictedValue = vwWrapper.Predict(parcel);
prediction.LogErr201610 = predictedValue;
prediction.LogErr201611 = predictedValue;
prediction.LogErr201612 = predictedValue;
prediction.LogErr201710 = predictedValue;
prediction.LogErr201711 = predictedValue;
prediction.LogErr201712 = predictedValue;
}
else
{
Console.WriteLine("ERROR: TEST: Failed to find parcelMap item for parcel id: " + prediction.parcelid);
}
}

Clearly, we'll want to enhance this to have better prediction values per actual prediction date range field. This sets us up with a nice top level initial framework to load our data, iterate training data, iterate test data, and have a prediction result. This prediction result will make a valid submission for the Kaggle competition.


Finally, we'll create a new output file that contains our submission data that we can upload directly to the Kaggle competition submission form:

// Now, write out our predictionList to a prediction output file
string predictionFileName = @"C:\kaggle\zillow\OutputPrediction.txt";
using (System.IO.StreamWriter file = new System.IO.StreamWriter(predictionFileName))
{
file.WriteLine("ParcelId,201610,201611,201612,201710,201711,201712");
StringBuilder sbOrderLine = new StringBuilder();
foreach (var prediction in predictionList)
{
sbOrderLine.Clear();
sbOrderLine.Append(prediction.parcelid);
sbOrderLine.Append(",");
sbOrderLine.Append(prediction.LogErr201610);
sbOrderLine.Append(",");
sbOrderLine.Append(prediction.LogErr201611);
sbOrderLine.Append(",");
sbOrderLine.Append(prediction.LogErr201612);
sbOrderLine.Append(",");
sbOrderLine.Append(prediction.LogErr201710);
sbOrderLine.Append(",");
sbOrderLine.Append(prediction.LogErr201711);
sbOrderLine.Append(",");
sbOrderLine.Append(prediction.LogErr201712);
file.WriteLine(sbOrderLine.ToString());
}
}


More to come with feature modeling, Vowpal Wabbit implementation, and non-trivial prediction results.