How to generate 1 billion rows using U-SQL

FacebooktwitterredditpinterestlinkedintumblrmailFacebooktwitterredditpinterestlinkedintumblrmail
FacebooktwitterlinkedinrssyoutubeFacebooktwitterlinkedinrssyoutube

I was interested in generating some dummy data to do some load testing in MS Azure and came up with a pretty nifty way to generate lots and lots of data using U-SQL.  The tip is to simply create a small U-SQL custom generator and use it to extract from a dummy file.

First I created a dummy file… literally….  In my input folder on my local machine I just created a blank, 0-byte file just to stop the custom extractor complain that I’m not actually going to use an input file.

The custom extractor uses a C# as follows

for (Int64 i = _startVal; i < _endVal; i+= _step)

and this loop simply generates a single column line using the output.Set function.  The full code for the code behind file is

using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace CustomExtractor
{


    [SqlUserDefinedExtractor(AtomicFileProcessing = false)]
    public class GenerateSeries : IExtractor
    {
        private readonly Int64 _startVal;
        private readonly Int64 _endVal;
        private readonly Int64 _step;

        public GenerateSeries(Int64 startVal , Int64 endVal, Int64 step)
        {
            this._startVal = startVal;
            this._endVal = endVal;
            this._step = step;

        }
        public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
        {
            for (Int64 i = _startVal; i < _endVal; i+= _step)
            {
                output.Set<Int64>(0, i);
                yield return output.AsReadOnly();
            }
            yield break;
        }
    }
}

Once you have this in place you can call it from your U-SQL script

 

DECLARE @in = "dummyfile.csv";
DECLARE @out = "result.csv";
 
@input =
    EXTRACT mycol Int64
    FROM @in
    USING new CustomExtractor.GenerateSeries(1,1000000000,1);

@t =
    SELECT mycol , 
           mycol + 2 AS AnotherColumn
    FROM @input;
 
OUTPUT @t
TO @out
USING  Outputters.Csv(outputHeader : true);

This script calls the CustomExtractor .GenerateSeries function and passes three arguments which in term become the three arguments used in the C# for loop.  So these can be customised pretty easily.

The @t select statement allows you to inject additional columns.  This could be where you generate columns for random dates, products, quantities etc on a pretty major scale if you wanted.

I first ran this locally on my machine and filled up my hard drive pretty quick, so switched to my Azure Data Lake Store where space is no issue.  With 2 verticies the query took 20 seconds to prep, sat 7 seconds in the queue but ran for 16 minutes.

The final result was a pretty easy to customise file with 1 billion rows that was about 20GB in my Azure Data Lake Store Account.GenerateSeries

FacebooktwitterredditpinterestlinkedintumblrmailFacebooktwitterredditpinterestlinkedintumblrmail
FacebooktwitterlinkedinrssyoutubeFacebooktwitterlinkedinrssyoutube
Philip Seamark on EmailPhilip Seamark on LinkedinPhilip Seamark on Twitter
Philip Seamark
Phil is Microsoft Data Platform MVP and an experienced database and business intelligence (BI) professional with a deep knowledge of the Microsoft B.I. stack along with extensive knowledge of data warehouse (DW) methodologies and enterprise data modelling. He has 25+ years experience in this field and an active member of Power BI community.

Leave a Reply