Using * in OPENROWSET to use Data Lake metadata

Use the values of wildcards in your SQL queryโ€”Very powerful way to convert your file path or filename metadata into actual data!

Letโ€™s say you stored your files like this:
yellow_tripdata_<year>-<month>.csv

For example, yellow_tripdata_ ๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿฎ-๐Ÿฌ๐Ÿญ.csv

Letโ€™s fetch those year and month data and turn them into SQL columns, with two simple steps (and a bonus tip)

๐Ÿญ) ๐—”๐—ฑ๐—ฑ ๐˜„๐—ถ๐—น๐—ฑ๐—ฐ๐—ฎ๐—ฟ๐—ฑ๐˜€ ๐—ถ๐—ป ๐—ข๐—ฃ๐—˜๐—ก๐—ฅ๐—ข๐—ช๐—ฆ๐—˜๐—ง()

By replacing those dynamic values with wildcards we can ensure we read all the data, but also subsequently that we can use the wildcards as data.

Code example:

FROM OPENROWSET(
BULK 'csv/taxi/yellow_tripdata_*-*.csv'

๐Ÿฎ) ๐—จ๐˜€๐—ฒ ๐˜„๐—ถ๐—น๐—ฑ๐—ฐ๐—ฎ๐—ฟ๐—ฑ๐˜€ ๐—ถ๐—ป ๐—ฆ๐—˜๐—Ÿ๐—˜๐—–๐—ง ๐˜„๐—ถ๐˜๐—ต ๐—ณ๐—ถ๐—น๐—ฒ๐—ฝ๐—ฎ๐˜๐—ต()

The function filepath() allows you to return the entire filepath of your files, but if you add a number inside the () you can refer to any wildcard from step 1.

Filepath(1) will represent the value of the first wildcard, (2) the second, and so on.

Code example:

SELECT
r.filepath() AS filepath
,r.filepath(1) AS [year]
,r.filepath(2) AS [month]

๐Ÿฏ) ๐—•๐—ผ๐—ป๐˜‚๐˜€ ๐˜๐—ถ๐—ฝ - ๐—ช๐—ผ๐—ฟ๐—ธ๐˜€ ๐—ถ๐—ป ๐—ช๐—›๐—˜๐—ฅ๐—˜ ๐˜€๐˜๐—ฎ๐˜๐—ฒ๐—บ๐—ฒ๐—ป๐˜

Using the wildcard references is not limited to SELECT, but also works in WHERE statements. This means you can use the metadata of your filepath or file name to filter your query.
Quite powerful!

Code example:

WHERE
r.filepath(1) IN ('2017')
AND r.filepath(2) IN ('10', '11', '12')

It even works if you use the alias instead, even though the value has been casted.
And thatโ€™s it! A little neat trick with lots of utility in the right hands.

Previous
Previous

Measure/Column dependencies in Power BI

Next
Next

Power BI Data Marts