3

I have a shapefile with some attributes (Hylac_id, Lake_name, ...) and i want to catch the polygone with a specefic Lake_name. The database than I use is available on the website https://www.hydrosheds.org/page/hydrolakes

I try with this method but if you have a completly different method I open to change my coding strategy:

  1. I open the shapefile with the module ogr of python
  2. I give a sql request to the method ExecuteSQL to shortlist my shapefile

this is the code that scheaches for example the lake Baikal

from osgeo import ogr

file2 = "/home/surface4/bernus/Base_Lac/HydroLake/HydroLAKES_polys_v10_shp/HydroLAKES_polys_v10.shp"
driver = ogr.GetDriverByName ("ESRI Shapefile")
ogr_ds = driver.Open(file2)
sql = "SELECT * FROM HydroLAKES_polys_v10 WHERE Lake_name='Baikal'"
layer = ogr_ds.ExecuteSQL(sql)

And when I print print layer[0]["Lake_name"], I get the first line of the database which is the Caspian Sea.

I try other request to see as:

  • "SELECT * FROM HydroLAKES_polys_v10 WHERE Hylac_id>5"

    This do not work, here i do not only have the first line but all the database

  • "SELECT * FROM HydroLAKES_polys_v10 WHERE Hylac_id<5"

    This work well

  • I have: "SELECT * FROM HydroLAKES_polys_v10 ORDER BY Lake_name"

    This work well

So I do not understand what I do in bad way, I try my request in online SQL database to check if my syntax is correct.

6
  • Can you share some test data? It may be all right that query "SELECT * FROM HydroLAKES_polys_v10 WHERE Hylac_id>5" selects many rows, or do you really have only one row with Hylac_id value that is greater than 5? Commented Mar 23, 2021 at 19:41
  • There is my results for each case: -"SELECT * FROM HydroLAKES_polys_v10 WHERE Lake_name='Baikal'" 1 row selected with Lake_name equal to 'Caspian Sea" -"SELECT * FROM HydroLAKES_polys_v10 WHERE Hylac_id>5" I select 1 427 683 rows and my first element has the Hylac_id attribute equal to 1 (normally I shouldn't according to the request) -"SELECT * FROM HydroLAKES_polys_v10 WHERE Hylac_id<5" I select 4 rows with all the entity with attribute Hylac_id below 5 -"SELECT * FROM HydroLAKES_polys_v10 ORDER BY Lake_name" I select all the rows (1 427 688) Commented Mar 23, 2021 at 20:32
  • 2
    If you run your 'Baikal' command with ogrinfo you will see that it works. In Python code GDAL seems to read attributes from the original dbf file so that first selected feature is paired with the first row of the dbf file. Perhaps you would like to ask from the gdal-dev mailing list why that happens. Commented Mar 23, 2021 at 20:38
  • 1
    Meanwhile you can use the SQLite dialect in this style >>> layer = ogr_ds.ExecuteSQL("select STATE_NAME from states where STATE_NAME='Ohio'", dialect='SQLite'). Commented Mar 23, 2021 at 20:44
  • I try ogrinfo with the same request and I have the good row. The option dialect='SQLite' solves this problem, I do not understand why, but it works (maybe with the problem of the dbf file). I will try the others requests before to change the status to resolve. Thanks for your solutions and explanations Commented Mar 23, 2021 at 20:58

1 Answer 1

4

You could use feature = layer.GetNextFeature() to get the first feature and then use feature["Lake_name"] to get your desired field.

Seems like using layer[0] does not take the SQL query into account. Using dialect="SQLITE" does (surprisingly) work, however, it may be better not to use it with large datasets.

Note: You may also want to consider using ogr_ds.ReleaseResultSet(layer) and feature.Destroy() when you finished.


Using SQL Query and OGR

from osgeo import ogr

# data downloadable from here: https://www.hydrosheds.org/page/hydrolakes
shapefile = r"data\HydroLAKES_polys_v10_shp\HydroLAKES_polys_v10.shp"

driver = ogr.GetDriverByName("ESRI Shapefile")
data_source = driver.Open(shapefile, 0) 

layer = data_source.GetLayer()
query = f"SELECT * FROM {layer.GetName()} WHERE Lake_name = 'Baikal'"
result = data_source.ExecuteSQL(query)

feature = result.GetNextFeature()
print(feature["Lake_name"]) # prints Baikal

feature.Destroy()
data_source.ReleaseResultSet(result)

Using an Attribute Filter (instead of SQL) and OGR

from osgeo import ogr

# data downloadable from here: https://www.hydrosheds.org/page/hydrolakes
shapefile = r"data\HydroLAKES_polys_v10_shp\HydroLAKES_polys_v10.shp"

driver = ogr.GetDriverByName("ESRI Shapefile")
data_source = driver.Open(shapefile, 0) # read-only

layer = data_source.GetLayer()
layer.SetAttributeFilter("Lake_name = 'Baikal'")

feature = layer.GetNextFeature()
print(feature["Lake_name"]) # prints Baikal

feature.Destroy()

Using GDAL instead of OGR

There is a note in the API documentation that OGR's data_source.ExecuteSQL(query) is deprecated:

Deprecated Use GDALDatasetExecuteSQL() in GDAL 2.0

See also:

Following an example using GDAL. It's pretty similar.

from osgeo import gdal

# data downloadable from here: https://www.hydrosheds.org/page/hydrolakes
shapefile = r"data\HydroLAKES_polys_v10_shp\HydroLAKES_polys_v10.shp"

data_source = gdal.OpenEx(shapefile, gdal.OF_VECTOR)

layer = data_source.GetLayer()
query = f"SELECT * FROM {layer.GetName()} WHERE Lake_name = 'Baikal'"
result = data_source.ExecuteSQL(query)

feature = result.GetNextFeature()
print(feature["Lake_name"])

feature.Destroy()
data_source.ReleaseResultSet(result)

Using dialect="SQLITE"

It looks like layer[0]["Lake_name"] works when using dialect="SQLITE" but it is slow since it queries the full table at once. I recommend to use result.GetNextFeature() instead of following example.

# data downloadable from here: https://www.hydrosheds.org/page/hydrolakes
shapefile = r"data\HydroLAKES_polys_v10_shp\HydroLAKES_polys_v10.shp"

driver = ogr.GetDriverByName("ESRI Shapefile")
data_source = driver.Open(shapefile, 0) 

layer = data_source.GetLayer()
query = f"SELECT * FROM {layer.GetName()} WHERE Lake_name = 'Baikal'"
result = data_source.ExecuteSQL(query, dialect="SQLITE")

print(result[0]["Lake_name"]) # prints Baikal

for feature in result:
    feature.Destroy()
data_source.ReleaseResultSet(result)

Using Geopandas instead of OGR

Note: Since the shapefile is huge (1GB+), a solution with Geopandas is not recommended though unless you have endless memory and a power machine. ;) Running the code below on my machine with your shapefile required at its peak 6 GB RAM.

However, it may help some other users which work with a smaller shapefile. That's why I leave it here.

import geopandas as gpd

# data downloadable from here: https://www.hydrosheds.org/page/hydrolakes
shapefile = r"data\HydroLAKES_polys_v10_shp\HydroLAKES_polys_v10.shp"

lakes = gpd.read_file(shapefile)
lakes = lakes[lakes["Lake_name"] == "Baikal"]
print(lakes.head(1))
2
  • All the solutions I tried and it works (except with geopandas as you says it is not a good solution with huge dataset). However I do not understand the logics of the ogr object. I thought that the result will only contain the result of the request but we can see that it is not the case by printing result[0]. Why result[0] and result.GetNextFeature() point different object? Commented Mar 24, 2021 at 10:49
  • 1
    @AnthonyB, I am not sure why result[0] is not returning the correct feature but I assume it's a bug especially since it would work with using dialect="SQLITE". Consider @user30184 suggestion (in his comment below your question) to ask the gdal-dev for clarification. ExecuteSQL(query) returns a Layer which is iterable using either a for loop or GetNextFeature() in a loop. I recommend to access the feature(s) this way. Note: I added an extra solution using GDAL since OGR's ExecuteSQL(query) seems to be deprecated. Commented Mar 24, 2021 at 15:25

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.