Skip to content

Conversation

@mraenker
Copy link
Contributor

@mraenker mraenker commented Nov 2, 2022

Reason for the bugfix

  • Customer complained, in emails, where the filenames of attachments are specified using Content-Type: application/octet-stream; charset=UTF-8; name=example_attachment_mail.csv or Content-Disposition: attachment; filename=example_attachment_mail.csv (meaning: a filename or name attribute, where the filename is not quoted), are not considered as valid filesnames and thus, the filetype with an added integer value is used as the filename of the attachment.
  • RFC 2045 defines, not using quotes is allowed, if the filename obliges to the rules described in the comment for the regex in my commit.

Explanation of the approach to fix it

  • First of all, please note, this PR will change the behavior of this function slightly - but, I think, only for good: filenames from now on will never contain quotes or doublequotes.
  • The approach is: use a regex with capture groups (and a proper explanation for the regex) to handle the rules of the RFC in this order: (1) the default case for double quoted filename= or name= attributes, (2) the same with single quotes (the RFC doesn't specify the method of quoting, so we take care of that aswell), (3) the case, where the filename isn't quoted.

Implementationdetails and Caveats

  • Instead of creating a complex regex with lookbehinds and stuff like that, I've opted to simply whitelist all chars the RFC specifies to be compatible to be filename, when no quotes are used.
  • Since the existing code does index into $aMatches[1] to get the result of the match, the content of $aMatches[1] can now be different compared to before, since we are now using capture groups and thus another level of possible results is introduced. But what we can be sure about is: $aMatches[1] will always contain a) the filename with quotes, or b) the filename without quotes. For that reason, we simply use str_replace() later on to remove all types of quotes from the filenames, since nobody wants filenames containing any kind of quotes, anyway. I think, no harm will be done by this slight change in behavior.
  • Here are three examples for the changed contents of $aMatches:
  1. Default case, where the filename is doublequoted, before the patch was
Array
(
    [0] => filename="moz-screenshot.png"
    [1] => moz-screenshot.png
)

and now will be

Array
(
    [0] => filename="moz-screenshot.png"
    [1] => "moz-screenshot.png"
    [2] => moz-screenshot.png
)

(Note the difference in [1] and [1] from before and after.)

  1. Case with single quotes, after the patch (wasn't handled correctly until now) will be
Array
(
    [0] => name='example_attachment_mail.csv'
    [1] => 'example_attachment_mail.csv'
    [2] => 
    [3] => example_attachment_mail.csv
)

(Note the quotes in [1].)

  1. Case without any quotes, after the patch (wasn't handled correctly until now) will be
    [0] => name=example_attachment_mail.csv
    [1] => example_attachment_mail.csv
    [2] => 
    [3] => 
    [4] => example_attachment_mail.csv
    [5] => example_attachment_mail.csv

(Note: No quotes in [1])

Testresults

  • I've created a new testmail test/emailsSample/email_with_and_without_quotes_around_attachment_name.eml containing all cases of a filename and a name attribute a) with double quotes, b) with single quotes, c) without any quotes. It is contained in this PR aswell.
  • I've ran utils/test.php for all testmails before the patch and after and created a diff of the output: Only my testcases and attachments of three other testmails, which don't comply with the standard occur. The explanation of the other testcases occuring are:
  • 1544c1544: In this case, no quotes are used for the filename, but the filename contains a space, which is not permitted as of the RFC.
  • 1618c1618 and 1622c1622: In this cases, the filename until now was wrongfully built, since no quotes, but containing a space resulting in a filename containing the attributes size and creation_date by accident.

All other attachment filenames stay the same. Please see the following screenshot for proof.
image

A word about the caveat

Yes, this PR will slightly change the behavior of the function (no filenames containing quotes are possible anymore), but I don't think, this will cause any harm (quite the opposite, I think). Furthermore, recreating the exact behavior with quoted filenames as before, will result in a way more complicated function, where we'd need to first check, which rule is applicable and then - based on the rule chosen - apply the right indexing into $aMatches. I think, this approach would overcomplicate things in an unnecessary way. - But of course we are free to discuss this.

/cc @larhip

Cheers,
Martin

@Molkobain Molkobain added the bug Something isn't working label Nov 2, 2022
@Molkobain Molkobain assigned Molkobain and unassigned Molkobain Nov 2, 2022
@mraenker
Copy link
Contributor Author

mraenker commented Nov 4, 2022

Hi Combodo,

short update:

  • The three different parts of the regex are now correctly paranthesized
  • Squashed the four commit into one, so you have a better overview

Tests were ran successfully again. Same diffs for against results before the patch and same results for against my "first-after-the-patch"-tests.

Cheers,
Martin

@mraenker
Copy link
Contributor Author

Rebased and Commits squashed and pushed again. @Molkobain : I hope, this is now ready for a review (when you got the time to do so).

Martin

@Molkobain Molkobain self-assigned this Dec 14, 2022
Copy link
Contributor

@Molkobain Molkobain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the code itself, it seems ok to me. Just ran some test with more "complex" filenames having single quotes (happens a lot in french), spaces and all. It all worked.

Only change requested is vocabulary. PR will be review during the next technical review. Thanks for adding the test sample!

@Molkobain
Copy link
Contributor

Discussed during technical review, code itself is accepted.
We just would like you to add a unit test for the regexp itself, so the test documents the test cases which will:

  • Improve robustness for futures changes
  • Ease the understanding of the RegExp itself through the documented examples

If you need some help in adding the unit test (extract the regexp in a dedicated const/method, add the test file, ...), let us know :)

@mraenker
Copy link
Contributor Author

mraenker commented Jan 4, 2023

Thank you for the follow up.

If you need some help in adding the unit test (extract the regexp in a dedicated const/method, add the test file, ...), let us know :)

If you can pinpoint me to an example, where this is done already, I think I can figure it out on my own. But I don't want to prepare something, which isn't helpful the way I did it (e. g. usage of PHPUnit or alike).

@Molkobain
Copy link
Contributor

Sure, you can:

  • Duplicate the combodo-email-synchro/test/TestEmlFiles.php test file (for namespace, parent class, ...)
  • Delete all methods (even the setUp() won't be necessary there)
  • Duplicate testToAcronym() and ToAcronymProvider() methods from the iTop test/application/UtilsTest.php file as an example.
    • ToAcronymProvider() will show you how to provide / document test cases
    • testToAcronym() will show you how to make the test itself

Mind that you will have to extract the RegExp in a class constant or method so that you can use it in both the business logic and the unit test.

@Molkobain
Copy link
Contributor

Moved to functional review to avoid loosing a month

@piRGoif
Copy link

piRGoif commented Jan 10, 2023

Accepted during functional review

@Molkobain
Copy link
Contributor

Moved back to "contributor update" as we are waiting for the unit test

Please note the comment in TestRegexAttachmentNames.php:55. Could
you please have a look, if you've got any idea, why this is happening?
@mraenker
Copy link
Contributor Author

mraenker commented Jan 11, 2023

Hi. Please find the Testcases in the latest commit. I have a problem with the ordinary Char 0x2C (Comma). It is not included in list of allowed chars of the original Regex, but nevertheless slips through in the testcase for not allowed chars. I don't understand this.

image

// Edit:

It works at regex101.com
image

@Molkobain
Copy link
Contributor

Molkobain commented Jan 12, 2023

Hello Martin,

Thanks for the unit test!

That being sad i'm not sure to understand it correctly. I was excepting something like the .eml lines to be tested to better document how filenames are expected to be extracted, but I may be wrong. For example:

	public function AttachmentFilenameProvider()
	{
		return [
			'No delimiter filename' => [
				"Content-Disposition: attachment; filename=example_attachment_mail.csv",
				"example_attachment_mail.csv",
			],
			'Single Quotes delimit filename' => [
				"Content-Disposition: attachment; filename='example_attachment_mail.csv'",
				"example_attachment_mail.csv",
			],
			'Double Quotes delimit filename' => [
				'Content-Disposition: attachment; filename="example_attachment_mail.csv"',
				'example_attachment_mail.csv',
			],
			'Empty String' => [
				'',
				'',
			],
			...
		];
	}

Any comment @piRGoif

@mraenker
Copy link
Contributor Author

mraenker commented Jan 12, 2023

Hello Martin,
That being sad i'm not sure to understand it correctly. I was excepting something like the .eml lines to be tested to better document how filenames are expected to be extracted, but I may be wrong.

As far as I understood, those were the requirements for the unit test:

Improve robustness for futures changes
Ease the understanding of the RegExp itself through the documented examples

From that, I thought, my unit test should be about the test of the Regex itself, not about the whole function GetAttachments(). If this understanding is correct, my testcases try to cover every possible case, by appliying this logic:

  1. Test the base case: All explicitly allowed chars should be considered allowed by the Regex itself and therefore, I list them all, to make sure, there are no chars, which are not allowed, but should per as the RFC, which means: every char in the first array should be considered a valid char and therefore the strings should equal each other.
  2. The negative base case: All implicitly not allowed chars should be considered denied by the Regex itself and therefore, a list of all of them should result in an empty result. Applying this logic also made me to discover, the Comma (0x2C) slips through. (Why? I don't know.)
  3. Edges cases: Delimiters work as expected (Single and Double quotes) + Empty String has no result (as the code in GetAttachments() is doing aswell).

I, surely, could use examples from the mails, like you had in mind, but those would only be examples and not cover the whole range of possibilites, as my cases currently do.

Isn't that more complete of an approach?

@Molkobain
Copy link
Contributor

Alright, fine by me 👍
Just waiting for Pierre's review as well, then it will be ok.

@mraenker
Copy link
Contributor Author

mraenker commented Jan 25, 2023

But you are right, we could/should discuss them more internally before giving any feedback... The only issue is that we don't really have an official dedicated time for PRs, it's just on good will. :/

Well :( I didn't knew that. This explains things. Thank you for all the extra work you are putting in.

It's not a strict process but we discuss/review the technical implementation of each PR together (R&D team) at least once a month (on the first Tuesday). In the meantime we often check them on our end to see if there are things that could be improved before that review so it doesn't take another month before it can move forward to the functional review. Then on the second Tuesday of each month we review PRs on their functional side with both R&D and Product owners.

Yeah, don't get me wrong. I wasn't trying to critizise your processes, whatever they may look like (especially with the info above). And I think you are completely right, requesting "simple" improvements upfront to avoid loosing another month on each PR is a good practice. - But maybe it's worth to shortly discuss those internally upfront, aswell. I think, it could reduce the amount of artificial stress on all sides, coming around with a (for the moment) complete list of requested changes, as the amount of necessary iterations of coding-review-communicate might be reduced on all sides.

Thank you for explaining. I'll work on the requested improvements in the next few days.

@Molkobain
Copy link
Contributor

Exactly we should try to discuss things internally before asking for changes multiple times which can be a pain in the ash for the PR initiator. Especially as you can get the feeling that it's a never ending situation 😅

We'll try to improve this in the future :)

@piRGoif
Copy link

piRGoif commented Jan 25, 2023

Hello,
All my apologies, I missed Guillaume saying it is fine with having only a regexp... and I can't find it anymore :/ (but this PR generated a lot of discussion, and there are lot of them in resolved conversation now...)

The usual process is only 1 Combodo dev is set as assignee to the PR. He reads it, check for Combodo requirements, and then present it to the rest of the team during reviews (technical, functional, as Guillaume said). We then give one consolidated answer.

Your PR already passed both reviews, but we asked for more changes afterwards.
Guillaume asked for my opinion, and I missed your PR reviews.., So actually we are in a grey area :)
Next time I'll have a chat IRL with the PR assignee before answering !
Sorry for this confusion :/

@Molkobain tell me when available so we can agree on this PR status O:)

@Molkobain
Copy link
Contributor

@Molkobain tell me when available so we can agree on this PR status O:)

I'm ok with changing the static property to a constant you are right. We just need to ensure that static:: is used instead of self::

@piRGoif
Copy link

piRGoif commented Jan 26, 2023

We discussed IRL with Guillaume to check every aspect of my review and the past remarks we made.

We are proposing to :

  • set the regexp as a const instead of an attribute
  • keep the existing regexp test as is
  • factorize the 2 current regexp use in a new method (containing : if != '' + preg_match + end)
  • no test to add on that new method

As you already done lot of work, I can make the modifications if you wish ? (and also if you checked the "allow maintainers" PR checkbox)

@mraenker
Copy link
Contributor Author

mraenker commented Jan 27, 2023

As you already done lot of work, I can make the modifications if you wish ? (and also if you checked the "allow maintainers" PR checkbox)

I'd be thankful, if you did the modifications, yes. Thank you. Unfortunately, I seem to be blind. Where can I find the "Allow Maintainers"-Checkbox on the PR? I even searched the page with Ctrl+F and can't find it. 😞

/edit:

All my apologies, I missed Guillaume saying it is fine with having only a regexp... and I can't find it anymore :/ (but this PR generated a lot of discussion, and there are lot of them in resolved conversation now...)

No problem at all. Thank you for the explanation!

@mraenker
Copy link
Contributor Author

@piRGoif : I've added you to our list of maintainers for this repo. I hope you got an invitation and you can then apply your changes.

@piRGoif
Copy link

piRGoif commented Feb 3, 2023

@mraenker Just pushed the modifications in f0521eb
Ok for you ?

@piRGoif
Copy link

piRGoif commented Feb 3, 2023

By the way, the "|allow edits from maintainers](https://docs.github.com/en/github-ae@latest/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork)" option is described in our contributing file

Someone told rencently this option is only available during PR creation, and that is what I understand reading the doc :

When a user creates a pull request from a fork that they own, the user generally has the authority to decide if other users can commit to the pull request's compare branch

@mraenker
Copy link
Contributor Author

mraenker commented Feb 6, 2023

@piRGoif : Looks totally fine for me. Thank you very much! - Thank you aswell for the info on when I need to set the checkbox to allow you to edit PRs.

I just re-synched our branch on our fork to this master. Do I need to do something else before you can continue?

@piRGoif
Copy link

piRGoif commented Feb 6, 2023

Hello,

You're welcome Martin, we are very glad you did this change, and again all our apologies for the confusion on our side (mine in particular O:) )

I see you made a merge commit from master to your fork branch... Not sure this will go well when merging the pr :/ We'll see.

@Molkobain
Copy link
Contributor

Accepted during technical review.
Guillaume will dig into the "coma thing in the unit test" before the functional review.

@piRGoif
Copy link

piRGoif commented Apr 11, 2023

Accepted during functional review.
@Molkobain waiting for your review on the broken test use case

@Molkobain
Copy link
Contributor

I spent some time on this issue (the comma thing in test cases) but I couldn't find why it's behaving like this...
The issue is the same on my setup as for Martin, comma behave differently than other chars. I have no clue why.

The only notable thing is that in the base REGEXP, the char \x2D (the one after comme -\x2C-) is highlighted differently by PHPStorm. Don't know if it means something related or not.
image

@Molkobain
Copy link
Contributor

Actually maybe it has something to do with 2A (*) and 2B (+) being interpreted as regexp tokens, which would explain why 2D (-) is highlighted differently.
If I move both 2A / 2B at the end of the regexp, 2D is no longer highlighted and all test cases pass.

image

image

So maybe this is just a matter of how to add the special regexp tokens as actual chars, not tokens.
I would advise to find a way to escape these tokens or to use regular chars to ease escaping.

Moving 2A / 2B at the end of the regexp is not a solution in itself, it's just moving the issue somewhere else 😁

What do you all think?

@Hipska
Copy link
Collaborator

Hipska commented May 6, 2023

I would also use regular characters.

I think maybe they all need to be double escaped? What do you get when you print the var FILENAME_REGEX?

@mraenker
Copy link
Contributor Author

mraenker commented Jun 6, 2023

Hi all.

To move on with this PR, I'd like to respond to the latest comments and summarize.

  1. Afaik the only open issue with the PR is the screwed up behavior with 2A and/or 2B
  2. We know, just putting them at the end won't solve the issue. So, we still would violate the RFC (which is of importance, if we state, it's and RFC-compliant handling of the case in question.
  3. Hipska suggests to use regular characters.
  4. We don't know, what happens, when we double-escape the two Hex-Representations (could you test this sometime, @Molkobain ?)

Regarding point 3) from Hipska: What I like especially is, that it's a pragmatic approach. I think, I even already tried it in the past, but noticed a big (at least for me) downside: The Regex contains "valid chars", which are easily mistaken for other chars or being not even recognized as chars, since they are somewhat special. But to be even more pracmatic: Does anything prevent us from mixing the representation? The char in question, which creates the problems, is the comma. A comma is easy to not to be mistaken (like space vs. tab or something alike). -> Let's just use the regular "comma"-Char and only use the Hex-Repr for chars, which can be easily mistaken.

What do you think about that?

@Hipska
Copy link
Collaborator

Hipska commented Jun 6, 2023

Did you already try a simple print (to log or with var_dump) of the FILENAME_REGEX var? I think that will explain a lot..

I just tried in 3v4l and it shows they definitely need to be double escaped!!

@Molkobain Molkobain removed their assignment Aug 19, 2024
@jf-cbd
Copy link
Member

jf-cbd commented Apr 25, 2025

Hello @mraenker, is there any news on this PR (since there is no activity since almost 2 years) ?

@Hipska
Copy link
Collaborator

Hipska commented Apr 25, 2025

As proof for my hypothesis: https://3v4l.org/41fBL

So we definitely need it as in Bar.

@jf-cbd jf-cbd moved this from Pending technical review to Pending contributor update in Combodo PRs dashboard Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

Status: Pending contributor update

Development

Successfully merging this pull request may close these issues.

6 participants