TypeError: sequence item 1: expected a bytes-like object, str found

Question

I am trying to extract English titles from a wiki titles dump that's in a text file using regex in Python 3. The wiki dump contains titles in other languages also and some symbols. Below is my code:

with open('/Users/some/directory/title.txt', 'rb')as f:
    text=f.read()
    letters_only = re.sub(b"[^a-zA-Z]", " ", text)
    words = letters_only.lower().split() 
print(words)

But I am getting an error:

TypeError: sequence item 1: expected a bytes-like object, str found

at the line: letters_only = re.sub(b"[^a-zA-Z]", " ", text)

But, I am using b'' to make output as byte type, below is a sample of the text file:

Destroy-Oh-Boy!!
!!Que_Corra_La_Voz!!
!!_(chess)
!!_(disambiguation)
!'O!Kung
!'O!Kung_language
!'O-!khung_language
!337$P34K
!=
!?
!?!
!?Revolution!?
!?_(chess)
!A_Luchar!
!Action_Pact!
!Action_pact!
!Adios_Amigos!
!Alabadle!
!Alarma!
!Alarma!_(album)
!Alarma!_(disambiguation)
!Alarma!_(magazine)
!Alarma!_Records
!Alarma!_magazine
!Alfaro_Vive,_Carajo!
!All-Time_Quarterback!
!All-Time_Quarterback!_(EP)
!All-Time_Quarterback!_(album)
!Alla_tu!
!Amigos!
!Amigos!_(Arrested_Development_episode)
!Arriba!_La_Pachanga
!Ask_a_Mexican!
!Atame!
!Ay,_Carmela!_(film)
!Ay,_caramba!
!BANG!
!Bang!
!Bang!_TV
!Basta_Ya!
!Bastardos!
!Bastardos!_(album)
!Bastardos_en_Vivo!
!Bienvenido,_Mr._Marshall!
!Ciauetistico!
!Ciautistico!
!DOCTYPE
!Dame!_!Dame!_!Dame!
!Decapitacion!
!Dos!
!Explora!_Science_Center_and_Children's_Museum
!F
!Forward,_Russia!
!Forward_Russia!
!Ga!ne_language
!Ga!nge_language
!Gã!ne
!Gã!ne_language
!Gã!nge_language
!HERO
!Happy_Birthday_Guadaloupe!
!Happy_Birthday_Guadalupe!
!Hello_Friends

I have searched online but could not succeed. Any help will be appreciated.

@imant i tried this also but i am getting below error: TypeError: cannot use a string pattern on a bytes-like object — Sherlock
– Sherlock, Commented Oct 2, 2016 at 15:22

Dimitris Fasarakis Hilliard · Accepted Answer · 2016-10-02 15:52:35Z

9

The problem is with the repl argument you supply, it isn't a bytes object:

letters_only = re.sub(b"[^a-zA-Z]", " ", b'Hello2World')
# TypeError: sequence item 1: expected a bytes-like object, str found

Instead, supply repl as a bytes instance b" ":

letters_only = re.sub(b"[^a-zA-Z]", b" ", b'Hello2World')
print(letters_only) 
b'Hello World'

Note: Don't prefix your literals with b and don't open the file with rb if you aren't looking for byte sequences.

edited Oct 2, 2016 at 15:52

answered Oct 2, 2016 at 15:21

Dimitris Fasarakis Hilliard

162k35 gold badges282 silver badges265 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Jean-François Fabre Over a year ago

very nice, I did not know it could be done on bytes. However I'm not sure it's the way to go here. Better go text-only and drop the bytes. Well, maybe it avoids encoding problems.

Sherlock Over a year ago

that works , now there is no error. But i am getting "b" prefix to every extracted word. Like this **[b'you', b'and', b'then', b'some']**but i think according to you it should not be there.

Dimitris Fasarakis Hilliard Over a year ago

@Jean-FrançoisFabre you were right ;-). Sherlock, just open the file without specifying b as @Jean suggests in his answer. b prefixed to the mode when opening files results in them being read as bytes objects, if that isn't what you need, drop it :-)

Jean-François Fabre Over a year ago

Let me say I'm pleased of the way the things turned out: Jim is the most knowledgeable of us all, he knew about the ability to use regexes for bytes, although us, mere mortals, just wanted to use a text file and knew zip about that! So everyone learned something and noone got bashed (I almost deleted my post at some point)

Jean-François Fabre · Accepted Answer · 2016-10-02 15:23:17Z

4

You have to choose between binary and text mode.

Either you open your file as rb and then you can use re.sub(b"[^a-zA-Z]", b" ", text) (text is a bytes object)

Or you open your file as r and then you can use re.sub("[^a-zA-Z]", " ", text) (text is a str object)

The second solution is more "classical".

edited Oct 2, 2016 at 15:23

answered Oct 2, 2016 at 15:18

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

Comments

Dartmouth · Accepted Answer · 2016-10-02 15:33:32Z

2

You can't use a byte string for your regex match when the replacement string isn't.
Essentially, you can't mix different objects (bytes and strings) when doing most tasks. In your code above, you are using a binary search string and a binary text, but your replacement string is a regular string. All arguments need to be of the same type, so there are 2 possible solutions to this.

Taking the above into account, your code could look like this (this will return regular string strings, not byte objects):

with open('/Users/some/directory/title.txt', 'r')as f:
    text=f.read()
    letters_only = re.sub(r"[^a-zA-Z]", " ", text)
    words = letters_only.lower().split() 
print(words)

Note that the code does use a special type of string for the regex - a raw string, prefixed with r. This means that python won't interpret escape characters such as \, which is very useful for regexes. See the docs for more details about raw strings.

edited Oct 2, 2016 at 15:33

answered Oct 2, 2016 at 15:25

Dartmouth

1,0892 gold badges17 silver badges23 bronze badges

2 Comments

Jean-François Fabre Over a year ago

actually you CAN do it, see Jim's answer. You should know it, I know it for at least ... 5 minutes :)

Dartmouth Over a year ago

@Jean-FrançoisFabre Hmmmmmmm... I do too now ;)

Suzanne Soy · Accepted Answer · 2021-04-20 20:12:47Z

0

You can also use br'…', which is the byte analog to r'…'. The replacement must also be a byte string.

letters_only = re.sub(br'[^a-zA-Z]', b' ', text)

answered Apr 20, 2021 at 20:12

Suzanne Soy

3,2956 gold badges43 silver badges61 bronze badges

Collectives™ on Stack Overflow

TypeError: sequence item 1: expected a bytes-like object, str found

4 Answers 4

4 Comments

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related