How to generate 1 billion rows using U-SQL

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmailFacebooktwittergoogle_plusredditpinterestlinkedintumblrmail

I was interested in generating some dummy data to do some load testing in MS Azure and came up with a pretty nifty way to generate lots and lots of data using U-SQL.  The tip is to simply create a small U-SQL custom generator and use it to extract from a dummy file.

First I created a dummy file… literally….  In my input folder on my local machine I just created a blank, 0-byte file just to stop the custom extractor complain that I’m not actually going to use an input file.

The custom extractor uses a C# as follows

and this loop simply generates a single column line using the output.Set function.  The full code for the code behind file is

Once you have this in place you can call it from your U-SQL script

 

This script calls the CustomExtractor .GenerateSeries function and passes three arguments which in term become the three arguments used in the C# for loop.  So these can be customised pretty easily.

The @t select statement allows you to inject additional columns.  This could be where you generate columns for random dates, products, quantities etc on a pretty major scale if you wanted.

I first ran this locally on my machine and filled up my hard drive pretty quick, so switched to my Azure Data Lake Store where space is no issue.  With 2 verticies the query took 20 seconds to prep, sat 7 seconds in the queue but ran for 16 minutes.

The final result was a pretty easy to customise file with 1 billion rows that was about 20GB in my Azure Data Lake Store Account.GenerateSeries

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmailFacebooktwittergoogle_plusredditpinterestlinkedintumblrmail
rssyoutuberssyoutube
Philip Seamark
Consultant at RADACAD
Phil is Microsoft Data Platform MVP and an experienced database and business intelligence (BI) professional with a deep knowledge of the Microsoft B.I. stack along with extensive knowledge of data warehouse (DW) methodologies and enterprise data modelling. He has 25+ years experience in this field and an active member of Power BI community.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">