I was interested in generating some dummy data to do some load testing in MS Azure and came up with a pretty nifty way to generate lots and lots of data using U-SQL. The tip is to simply create a small U-SQL custom generator and use it to extract from a dummy file.
First I created a dummy file… literally…. In my input folder on my local machine I just created a blank, 0-byte file just to stop the custom extractor complain that I’m not actually going to use an input file.
The custom extractor uses a C# as follows
for (Int64 i = _startVal; i < _endVal; i+= _step)
and this loop simply generates a single column line using the output.Set function. The full code for the code behind file is
using Microsoft.Analytics.Interfaces; using Microsoft.Analytics.Types.Sql; using System; using System.Collections.Generic; using System.IO; using System.Linq; using System.Text; namespace CustomExtractor { [SqlUserDefinedExtractor(AtomicFileProcessing = false)] public class GenerateSeries : IExtractor { private readonly Int64 _startVal; private readonly Int64 _endVal; private readonly Int64 _step; public GenerateSeries(Int64 startVal , Int64 endVal, Int64 step) { this._startVal = startVal; this._endVal = endVal; this._step = step; } public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output) { for (Int64 i = _startVal; i < _endVal; i+= _step) { output.Set<Int64>(0, i); yield return output.AsReadOnly(); } yield break; } } }
Once you have this in place you can call it from your U-SQL script
DECLARE @in = "dummyfile.csv"; DECLARE @out = "result.csv"; @input = EXTRACT mycol Int64 FROM @in USING new CustomExtractor.GenerateSeries(1,1000000000,1); @t = SELECT mycol , mycol + 2 AS AnotherColumn FROM @input; OUTPUT @t TO @out USING Outputters.Csv(outputHeader : true);
This script calls the CustomExtractor .GenerateSeries function and passes three arguments which in term become the three arguments used in the C# for loop. So these can be customised pretty easily.
The @t select statement allows you to inject additional columns. This could be where you generate columns for random dates, products, quantities etc on a pretty major scale if you wanted.
I first ran this locally on my machine and filled up my hard drive pretty quick, so switched to my Azure Data Lake Store where space is no issue. With 2 verticies the query took 20 seconds to prep, sat 7 seconds in the queue but ran for 16 minutes.
The final result was a pretty easy to customise file with 1 billion rows that was about 20GB in my Azure Data Lake Store Account.